Data Integrator DM370R2 Learner Guide GA

BusinessObjectsData Integrator XI R1/R2: Extracting,

Transforming, and Loading Data

Learner’s GuideDM370R2

CopyrightPatents

Business Objects owns the following U.S. patents, which may cover products that are offered and sold by Business Objects: 5,555,403, 6,247,008 B1, 6,578,027 B2, 6,490,593 and 6,289,352.

Trademarks

Business Objects, the Business Objects logo, Crystal Reports, and Crystal Enterprise are trademarks or registered trademarks of Business Objects SA or its affiliated companies in the United States and other countries. All other names mentioned herein may be trademarks of their respective owners. Product specifications and program conditions are subject to change without notice.

Copyright

Copyright © 2005 Business Objects SA. All rights reserved.

Copyright © 2005 Business Objects SA. All rights reserved. i

C O N T E N T S

About this CourseCourse objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviCourse audience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviPrerequisite education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviPrerequisite knowledge/experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiCourse success factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiCourse materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviiiLearning process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xviiiApplicable certification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix

Computer SetupIntroduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxiiComputer Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiii

Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiiiSoftware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxiiiTesting the hardware and software setup . . . . . . . . . . . . . . . . . . . . . . . . . . .xxvi

Setting up for the activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxvii

Getting Help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xxix

Lesson 1 Data Warehousing ConceptsUnderstanding Data Warehousing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2

Normal forms of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2Dimensional modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5Components that make up dimension tables . . . . . . . . . . . . . . . . . . . . . . . . 1-5Types of Slowly Changing Dimensions (SCD) . . . . . . . . . . . . . . . . . . . . . . . 1-6

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9Quiz: Data Warehousing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9

Lesson 2 Understanding Data IntegratorUnderstanding Data Integrator benefits and functions . . . . . . . . . . . . . . . . . . . . . 2-2

Data Integrator benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2Single point of integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3Key Data Integrator performance functions . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3

ii Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Applying transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3Understanding Data Integrator architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5

Data Integrator Designer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6Data Integrator repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6Data Integrator Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7Data Integrator Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7Repository Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9Server Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9

Understanding Data Integrator objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10Single-use objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11Reusable objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11Projects and jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13Object hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14

Using the Data Integrator Designer interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17Key areas of the Designer window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18Local object library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19Project area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20Tool palette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22Workspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24Quiz: Understanding Data Integrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24

Lesson 3 Defining Source and Target MetadataUsing datastores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

Explaining what a datastore is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2Creating a database datastore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4Changing a datastore definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5

Importing metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7Listing types of metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7Capturing metadata information from imported data . . . . . . . . . . . . . . . . . . . 3-7Imported table information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7Imported stored function and procedure information . . . . . . . . . . . . . . . . . . . 3-8Importing metadata by browsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9Activity: Creating an operational datastore (ODS) and importing table metadata for the source table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10Activity: Creating a target datastore and importing metadata for target tables 3-12

Defining a file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14Explaining what a file format is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14Creating file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15Property Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16Handling errors in file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20Activity: Creating a file format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24Quiz: Defining Source and Target Metadata . . . . . . . . . . . . . . . . . . . . . . . . 3-24

Table of Contents—Learner’s Guide iii

Lesson 4 Creating a Batch JobCreating a batch job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2

Creating a project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2Objects that make up a project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2Creating a job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4Naming conventions for objects in jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4Adding, connecting, and deleting objects in the workspace . . . . . . . . . . . . . 4-5Creating a work flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6Steps in a work flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6Order of execution in work flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7Example of a work flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10

Creating a simple data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Creating a data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Naming data flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Steps in a data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12Data flows as steps in work flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13Intermediate data sets in a data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13Data flow example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13Explaining source and target objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14Target objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15Prerequisites for using source or target objects in a data flow . . . . . . . . . . 4-15Adding source and target objects to a data flow . . . . . . . . . . . . . . . . . . . . . 4-16

Using the Query transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17Explaining what a transform is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17Understanding the Query transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19Describing the Query editor window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21Using the Query transform in a data flow . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22

Understanding the Target Table editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23Setting the database properties in the Target tab . . . . . . . . . . . . . . . . . . . . 4-23Describing table loading options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24Options tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24

Executing the job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29Understanding job execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29Executing the job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-30Activity: Defining a data flow to load data into a dimension . . . . . . . . . . . . 4-32Activity: Using a format file to populate a target table . . . . . . . . . . . . . . . . . 4-35

Adding a new table to the target using template tables . . . . . . . . . . . . . . . . . . . 4-36Using template tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-36Activity: Using template tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-39

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-41Quiz: Creating a Batch Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-41

Lesson 5 Validating, Tracing, and Debugging Batch JobsUnderstanding pushed-down operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2

Pushed-down operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2Viewing SQL generated by a data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3

Using descriptions and annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4Using descriptions with objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4Using annotations to describe job, work, and data flows . . . . . . . . . . . . . . . 5-5

iv Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Validating and tracing jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6Validating jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6Setting job execution options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7Tracing jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7Using log files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9Examining monitor logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10Examining statistics logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10Examining error logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10Displaying job logs in the Designer project area . . . . . . . . . . . . . . . . . . . . . 5-11Determining success for job execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12Activity: Setting traces and adding annotations in a job . . . . . . . . . . . . . . . 5-13

Using View Data and the Interactive Debugger . . . . . . . . . . . . . . . . . . . . . . . . . 5-14Using View Data with sources and targets . . . . . . . . . . . . . . . . . . . . . . . . . 5-14Using the Interactive Debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17Setting filters and breakpoints for a debug session . . . . . . . . . . . . . . . . . . . 5-19Activity: Using the interactive debugger with filters and breakpoints . . . . . . 5-21

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23Quiz: Validating, Tracing, and Debugging Batch Jobs . . . . . . . . . . . . . . . . 5-23

Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25Reading multiple file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25

Lesson 6 Using Built-in Transforms and Nested DataDescribing built-in transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2

Built-in transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2Case transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4Activity: Using the Case transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6Merge transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9Validation transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12Activity: Using the Validation transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17Date_Generation transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20

Understanding nested data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21Understanding hierarchical data representation . . . . . . . . . . . . . . . . . . . . . 6-21Explaining what nested data is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22Importing data from XML documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24Importing metadata from a DTD file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-24Importing metadata from an XML schema . . . . . . . . . . . . . . . . . . . . . . . . . . 6-27

Understanding operations on nested data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-30Explaining uses of nested data and the Query transform . . . . . . . . . . . . . . 6-30FROM clause construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-31Unnesting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-34Activity: Populating the Material dimension from an XML file . . . . . . . . . . . 6-37Using the XML_Pipeline transform in a data flow . . . . . . . . . . . . . . . . . . . . 6-41XML_Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-41Activity: Defining the XML_Pipeline in a data flow . . . . . . . . . . . . . . . . . . . . 6-43

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-48Quiz: Using Built-in Transforms and Nested Data . . . . . . . . . . . . . . . . . . . . 6-48

Table of Contents—Learner’s Guide v

Lesson 7 Using Built-in FunctionsDefining built-in functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2

Explaining what a function is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2Differentiating between functions and transforms . . . . . . . . . . . . . . . . . . . . . 7-2Listing the types of operations for functions . . . . . . . . . . . . . . . . . . . . . . . . . 7-3Listing the types of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4

Using functions in expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5Using functions in expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5

Using built-in functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8Using date and time functions and the date_generation transform to build a di-mension table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8to_char . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8to_date . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9julian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10month . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10quarter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10Date_Generation transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11Activity: Populating the time dimension data flow . . . . . . . . . . . . . . . . . . . . 7-12Use lookup functions to look up status in a table . . . . . . . . . . . . . . . . . . . . 7-14lookup_ext . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15Lookup_seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19Use match pattern functions to compare input string patterns . . . . . . . . . . 7-21Activity: Using the Lookup_ext() Function . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23Using database functions to return information on data sources . . . . . . . . 7-26db_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26db_version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27db_database_name . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28db_owner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29Activity: Using the db_name and db_type functions . . . . . . . . . . . . . . . . . . 7-31Activity: Using the decode function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-33

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-35Quiz: Using Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-35

Lesson 8 Using Data Integrator Scripting Language and VariablesUnderstanding variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2

Understanding variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2Describing the variables and parameters window . . . . . . . . . . . . . . . . . . . . . 8-3Local variables and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5Variable values and the Smart Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5Expression and variable substitution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5Explaining differences between global and local variables . . . . . . . . . . . . . . 8-6

Understanding global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7Creating global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7Viewing global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8Setting global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8Activity: Creating global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10

Understanding Data Integrator scripting language . . . . . . . . . . . . . . . . . . . . . . 8-13

vi Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Explaining language syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13Script usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14Syntax for statements in scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14Syntax for column and table references in expressions . . . . . . . . . . . . . . . 8-14Basic syntax rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14Statements can include . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15Script examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16Using strings and variables in Data Integrator scripting language . . . . . . . . 8-16Quotation Marks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16Escape characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16Trailing blank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17NULLs, empty strings, and trailing blanks in sources, transforms, and functions 8-17

Scripting a custom function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20Determining when to use custom functions . . . . . . . . . . . . . . . . . . . . . . . . . 8-20Creating a custom function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26Quiz: Using Data Integrator Scripting Language and Variables . . . . . . . . . 8-26

Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27Evaluating and validating data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27

Lesson 9 Capturing Changes in DataUsing source-based Changed Data Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2

Explaining what Changed Data Capture (CDC) is . . . . . . . . . . . . . . . . . . . . . 9-2Using Changed Data Capture (CDC) with time-stamped sources . . . . . . . . . 9-3Initial and Delta loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6Creating an initial load job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7Creating a delta load job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9Overlaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10

Using target-based CDC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18Table comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-18

Using history preserving transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20Explaining history preservation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-20Generated keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-21Identifying history preserving transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-22Table_Comparison transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-22History_Preserving transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-24Key_Generation transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-26Activity: Preserving history for changes in data with slowly changing dimensions 9-27

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32Quiz: Capturing Changes in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32

Table of Contents—Learner’s Guide vii

Lesson 10 Handling Errors and AuditingUnderstanding recovery mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2

Listing levels of data recovery strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2Partially loaded jobs and automatic recovery . . . . . . . . . . . . . . . . . . . . . . . 10-3Marking recovery units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3Recovery mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4Partially loaded data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6Creating recoverable work flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7Activity: Creating a recoverable work flow . . . . . . . . . . . . . . . . . . . . . . . . . 10-10Using try/catch blocks to specify alternate work flow options . . . . . . . . . . 10-12Try/catch blocks and automatic recovery . . . . . . . . . . . . . . . . . . . . . . . . . 10-13Processing data with problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-14

Understanding auditing in data flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17Defining audit points and rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17Audit points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18Audit label names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19Audit rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19Defining audit actions on failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22Viewing audit results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23Resolving invalid audit labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23Explaining guidelines for choosing audit points . . . . . . . . . . . . . . . . . . . . . 10-24Activity: Using auditing in a data flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-25

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27Quiz: Handling Errors and Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27

Lesson 11 Supporting a Multi-user EnvironmentWorking in a multi-user environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2

Explaining the Data Integrator development process . . . . . . . . . . . . . . . . . 11-2Designing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3Describing terminology used in a multi-user environment . . . . . . . . . . . . . . 11-4Explaining repository types in a multi-user environment . . . . . . . . . . . . . . . 11-5

Setting up a multi-user environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6Creating a central repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6Defining a connection to a central repository . . . . . . . . . . . . . . . . . . . . . . . 11-7Activating a central repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7Activity: Creating a central repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9

Describing common tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11Adding objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11Activity: Importing and adding objects in the central repository . . . . . . . . 11-13Adding objects with filtering to the central repository . . . . . . . . . . . . . . . . 11-14Checking out objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15Undoing check outs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18Checking in objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18Activity: Checking in and out objects from the central repository . . . . . . . 11-20Getting objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23Labeling objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24Comparing objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26

viii Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-27Navigation bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28Status bar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28Viewing object history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30Deleting objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-35Quiz: Supporting a Multi-user Environment . . . . . . . . . . . . . . . . . . . . . . . . 11-35

Lesson 12 Migrating ProjectsUnderstanding migration tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2

Preparing for migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2External data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2Directory locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2Schema structures and owners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3Describing migration mechanisms and tools . . . . . . . . . . . . . . . . . . . . . . . . 12-4Choosing a migration mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4

Using datastore configurations and migration . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5Creating multiple configurations in a a datastore . . . . . . . . . . . . . . . . . . . . . 12-5Configurations toolbar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8Using the Rename Owner tool to rename database objects . . . . . . . . . . . 12-10Using the Rename Owner tool in a multi-user environment . . . . . . . . . . . 12-11Creating a system configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17

Migrating a multi-user environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18Distinguishing between a phased and a versioned migration . . . . . . . . . . 12-18Adding a project to a central repository . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21Getting the latest version of a project . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21Updating a project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22Copying content from central repositories . . . . . . . . . . . . . . . . . . . . . . . . . 12-22

Migrating a single-user environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24Importing and exporting objects to a repository . . . . . . . . . . . . . . . . . . . . . 12-24Exporting objects to a file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29Exporting and importing a repository to a file . . . . . . . . . . . . . . . . . . . . . . 12-35

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-38Quiz: Migrating Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-38

Lesson 13 Using the AdministratorUsing the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2

Logging into the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2Describing the Administrator interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3

Managing the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5Viewing available repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5Adding a repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5Adding user roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6

Managing batch jobs with the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7Setting the status interval for job execution . . . . . . . . . . . . . . . . . . . . . . . . . 13-7Setting the job log retention period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8Executing jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9Scheduling jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10Scheduling with the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10

Table of Contents—Learner’s Guide ix

Monitoring jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16Activity: Scheduling and executing jobs in a third- party tool . . . . . . . . . . 13-19

Understanding server group basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20Architecture, load balance index, and job execution . . . . . . . . . . . . . . . . . 13-20Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-20Job execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-21Working with server group and Designer options . . . . . . . . . . . . . . . . . . . 13-21Adding a server group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-22Editing and removing server groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-23Viewing Job Server status in server groups . . . . . . . . . . . . . . . . . . . . . . . 13-24

Understanding central repository security basics . . . . . . . . . . . . . . . . . . . . . . 13-25Explaining central repository security . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-25Creating a secure central repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-26Defining connections to a secure central repository . . . . . . . . . . . . . . . . . 13-26Implementing group permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-27Viewing and modifying permissions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-29

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-31Quiz: Using the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-31

Lesson 14 Profiling DataUsing the Data Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2

Explaining what data profiling is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2Setting up a Data Profiler repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-3Adding Data Profiler users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7Configuring Data Profiler tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8Connecting the Designer to the Data Profiler server . . . . . . . . . . . . . . . . . 14-10Activity: Setting up a Data Profiler repository . . . . . . . . . . . . . . . . . . . . . . 14-11Submitting a profiling task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13Monitoring Data Profiler tasks in the Administrator . . . . . . . . . . . . . . . . . . 14-21Activity: Submitting a column profiler task . . . . . . . . . . . . . . . . . . . . . . . . . 14-23Activity: Submitting a relationship profiler task . . . . . . . . . . . . . . . . . . . . . 14-25

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-29Quiz: Profiling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-29

Lesson 15 Managing MetadataUsing metadata reporting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2

Explaining what metadata is . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2Support for metadata exchange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2Importing and exporting metadata using the Metadata Exchange feature . 15-3Using reporting tables and views to analyze metadata . . . . . . . . . . . . . . . . 15-4

Using the metadata reporting tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-5Accessing Data Integrator Metadata Reports . . . . . . . . . . . . . . . . . . . . . . . 15-5Viewing Impact and Lineage Analysis metadata reports . . . . . . . . . . . . . . . 15-7Impact analysis tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9Lineage analysis tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10Activity: Using Impact and Lineage Analysis . . . . . . . . . . . . . . . . . . . . . . . 15-12Viewing operational dashboard metadata reports . . . . . . . . . . . . . . . . . . . 15-14Job execution statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-15Job execution duration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17

x Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Changing dashboard display settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-21Using Auto Documentation metadata reports . . . . . . . . . . . . . . . . . . . . . . 15-22Viewing object information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-23Generating Auto Documentation for an object . . . . . . . . . . . . . . . . . . . . . . 15-27Changing auto documentation display settings . . . . . . . . . . . . . . . . . . . . . 15-28Activity: Creating Auto Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-29

Lesson Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-31Quiz: Managing Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-31

Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-33Building MySales warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-33

Appendix A Answer Key

Quiz: Data Warehousing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-2Quiz: Understanding Data Integrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-3Quiz: Defining Source and Target Metadata . . . . . . . . . . . . . . . . . . . . . . . . .A-4Quiz: Creating a Batch Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-5Quiz: Validating, Tracing, and Debugging Batch Jobs . . . . . . . . . . . . . . . . .A-6Quiz: Using Built-in Transforms and Nested Data . . . . . . . . . . . . . . . . . . . . .A-7Quiz: Using Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-8Quiz: Using Data Integrator Scripting Language and Variables . . . . . . . . . .A-9Quiz: Capturing Changes in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-10Quiz: Handling Errors and Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-11Quiz: Supporting a Multi-user Environment . . . . . . . . . . . . . . . . . . . . . . . . .A-13Quiz: Migrating Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-14Quiz: Using the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-15Quiz: Profiling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-16Quiz: Managing Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .A-17

Copyright © 2005 Business Objects SA. All rights reserved. xi

A G E N D AData Integrator XI R1/R2: Extracting,

Transforming, and Loading Data

Lesson 1Data Warehousing Concepts . . . . . . . . . . . . . . . . . . . . . . . . . 30 minutes

Understanding Data Warehousing Concepts

Lesson 2Understanding Data Integrator . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 hour

Understanding Data Integrator benefits and functionsUnderstanding Data Integrator architectureUnderstanding Data Integrator objectsUsing the Data Integrator Designer interface

Lesson 3Defining Source and Target Metadata. . . . . . . . . . . . . . . . . . . . . 1 hour

Using datastoresImporting metadataDefining a file format

Lesson 4Creating a Batch Job . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 hours

Creating a batch jobCreating a simple data flowUsing the Query transformUnderstanding the Target Table editorExecuting the jobAdding a new table to the target using template tables

Lesson 5Validating, Tracing, and Debugging Batch Jobs . . . . . . . . . . 1.5 hours

Understanding pushed-down operationsUsing descriptions and annotationsValidating and tracing jobsUsing View Data and the Interactive DebuggerWorkshop

xiiData Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Lesson 6Using Built-in Transforms and Nested Data . . . . . . . . . . . . . . 2 hours

Describing built-in transformsUnderstanding nested dataUnderstanding operations on nested data

Lesson 7Using Built-in Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 hours

Defining built-in functionsUsing functions in expressionsUsing built-in functions

Lesson 8Using Data Integrator Scripting Language and Variables. . . . 1 hour

Understanding variablesUnderstanding global variablesUnderstanding Data Integrator scripting languageScripting a custom functionWorkshop

Lesson 9Capturing Changes in Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 hours

Using source-based Changed Data CaptureUsing target-based CDCUsing history preserving transforms

Lesson 10Handling Errors and Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . 2 hours

Understanding recovery mechanismsUnderstanding auditing in data flows

Lesson 11Supporting a Multi-user Environment . . . . . . . . . . . . . . . . . . . . 1 hour

Working in a multi-user environmentSetting up a multi-user environmentDescribing common tasks

Lesson 12Migrating Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 hour

Understanding migration toolsUsing datastore configurations and migrationMigrating a multi-user environmentMigrating a single-user environment

Agenda—Learner’s Guide xiii

Lesson 13Using the Administrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 hour

Using the AdministratorManaging the AdministratorManaging batch jobs with the AdministratorUnderstanding server group basicsUnderstanding central repository security basics

Lesson 14Profiling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 hour

Using the Data Profiler

Lesson 15Managing Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 hour

Using metadata reportingUsing the metadata reporting toolWorkshop

xivData Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Copyright © 2005 Business Objects SA. All rights reserved. xv

About this Course

This section explains the conventions used in the course and in this training guide.

xvi Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Course objectivesBusinessObjects™ Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data is a classroom-based course where participants learn to design efficient ETL (extraction, transformation and loading) projects with Data Integrator to enable users to generate concise information from multiple data sources.

. The course includes presentation of concepts, demonstration of features, facilitated discussions, practice activities, and reviews.

After completing this course, you will be able to: • Describe BusinessObjects Data Integrator architecture• Define source and target metadata to import into your ETL projects• Create, validate, trace, and debug batch jobs• Use nested data, built-in Transforms and Functions to support data flow

movement requirements• Use global variables and Data Integrator Scripting Language• Capture changes in data and handle errors and exceptions• Support a multi-user environment, administer batch jobs, and migrate

projects using datastore and system configurations• Profile data• Manage metadata

Course audienceThis course is designed for individuals responsible for implementing projects involving the extraction, transformation and loading of data in batch jobs, administering and managing projects that involve Data Integrator.

Prerequisite education• Not applicable in this offering

About this Course—Learner’s Guide xvii

Prerequisite knowledge/experienceTo be successful, learners who attend this course must have working knowledge of:• Data warehousing and ETL concepts• Microsoft SQL Server • SQL language• Functions, elementary procedural programming and flow-of-control

statements, for example: If, Else, While, Loop

It is also recommended you review these articles prior to attending the course:

http://www.rkimball.com/html/articles.html • Articles: Data Warehouse Fundamentals:

• TCO Starts with the End User • Fact Tables and Dimension Tables

• Articles: Data Warehouse Architecture and Modeling• There Are No Guarantees

• Articles: Architecture/Modeling: Advance Dimension Topics• Surrogate Keys• It's Time for Time• Slowly Changing Dimensions

• Articles: Architecture/Modeling: Industry- and Application-Specific Issues• Think Globally, Act Locally

• Articles: Data Staging and Data Quality• Dealing with Dirty Data

Course success factorsYour learning experience will be enhanced by:• Activities that build on the life experiences of the learner• Discussion that connects the training to real working environments• Learners and instructor working as a team• Active participation by all learners

xviii Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Course materialsThe materials included with the course materials are:• Name card• Learner’s Guide

The Learner’s Guide contains an agenda, learner materials, and practice activities. The Learner’s Guide is designed to assist students who attend the classroom-based course and outlines what learners can expect to achieve by participating in this course.

• Evaluation formAt the conclusion of this course, provide feedback on the course content, instructor, and facility through the evaluation process. Your comments will assist us to improve future courses.

Additional information for the course is provided on the resource CD or as a hard copy:• Sample files

The sample files on the resource CD can include required files for the course activities and/or supplemental content to the training guide.

Additional resources include:• Online Help and User’s Guide

Retrieve information and find answers to questions using the Online Help and/or User’s Guide that are included with the product.

Learning processLearning is an interactive process between the learners and the instructor. By facilitating a cooperative environment, the instructor guides the learners through the learning framework.What’s specific to BusinessObjects XI Release 2?

The audience for this course may consist of BusinessObjects XI Release 1and BusinessObjects XI Release 2 customers. This icon has been placed throughout the guide to identify features that are specific to BusinessObjects XI Release 2.

About this Course—Learner’s Guide xix

Why am I here? What’s in it for me?

The learners will be clear about what they are getting out of each lesson.

How do I achieve the outcome?

The learners will assimilate new concepts and how to apply the ideas presented in the lesson. This step sets the groundwork for practice.

How do I do it?

The learners will demonstrate their knowledge as well as their hands-on skills through the activities.

How did I do?

The learners will have an opportunity to review what they have learned during the lesson. Review reinforces why it is important to learn particular concepts or skills.

Where have I been and where am I going?

The summary acts as a recap of the learning objectives and as a transition to the next section.

Applicable certificationNot Applicable

Introduction

Objectives

Practice

Review

Summary

xx Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Copyright © 2005 Business Objects SA. All rights reserved. xxi

Computer Setup

This section lists the hardware and software setup requirements for trying the activities on your own.

xxii Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

IntroductionThe purpose of this setup guide is to provide the information necessary to set up your computer and ensure the necessary course files are installed if you want to recreate the learning environment.

Computer Setup—Learner’s Guide xxiii

Computer Setup

HardwareYou can install Data Integrator components on one or more computers based on available resources and the amount of traffic the system processes.

The minimum hardware requirements are:• Pentium processor with at least 128 MB RAM and 100 MB free disk

space.• 1 GB virtual memory.• Screen resolution of 1024 x 768 pixels with 16-bit or 65536 colors

recommended (minimum 256 colors).• Paging file with a minimum setting of 256 MB and a maximum setting

of 512 MB.• Recommended for best performance: dual processors (minimum 500

Mhz) with at least 512 MB physical memory.

Software The software required for this course is:

• An operating system• Microsoft SQL Server as your server database software• Data Integrator XI R1/R2

Install software according to the instructions in this document.

xxiv Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

To install required software1 Install one of the following operation systems:

• NT• 2000 Professional• 2000 Server• 2000 Advanced Server• 2000 Datacenter Server• XP• 2003

2 Microsoft SQL Server 2000 Enterprise or Personal edition as your database server software for the database serving as your Data Integrator repository if you are running on NT, 2003, 2000 Server, Advanced or DataCenter.Note: Install Microsoft SQL Server 2000 Personal if you are running on

Windows 2000 Professional or XP. Microsoft SQL Server 2000 Enterprise is not supported for Windows 2000 Professional or XP operating systems.When you are prompted to specify an Authentication mode during the MS SQL server installation, make sure you specify a Mixed Mode for both NT and SQL Server system account recognition.

3 Create the required databases for installing Data Integrator XI R1/R2. See the section To create the databases required for Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data for instructions.Note: These instructions also include the creation of databases for

completing the activities in this course.4 Install Data Integrator XI R1/R2. See installation procedures under

“Installing Data Integrator on Windows — Running the installation program” Chapter 6 in the Data Integrator Getting Started Guide. Note: When you follow the instructions to install Data Integrator XI R1/

R2, make sure that you:• Save your license file to an appropriate directory, such as your

desktop, for easy access during installation. • Create a local Data Integrator repository

Use the direpo database that you created when you are prompted to create the Data Integrator repository and configure the Data Integrator Job Server during the installation process.

• Configure the Data Integrator Job Server

Computer Setup—Learner’s Guide xxv

To create the databases required for BusinessObjects Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data1 Click Start, point to Programs, then point to Microsoft SQL Server, and

click Enterprise Manager.2 Under the Console Root folder, expand the folders and navigate to your

local Windows machine.Under your local machine you will see many folders including the Databases and Security folders.

3 Use the attributes in the table below to create a Data Integrator repository, source and target databases, and user names. • Right-click Databases to create your repository databases. • Expand Security and right-click Login to create your user names:

All the users should have the following SQL Server Login properties:• In the General Tab, under Authentication, select SQL Server

Authentication. No password is needed for the purpose of this course.

• In the Server Roles Tab, select System Administrator.• In the Database Access Tab, under Permit select the corresponding

database each user name is to be associated to. For example, user name direpouser should be associated to direpo database.

• In the Database Access Tab, under Permit in Database roles, select public and dbowner.

4 Do a brand new install of all the default Data Integrator XI R1/R2 components

Database Type Database Name Username

Repository direpo direpouser

Source database ODS ODSuser

Target database Target targetuser

Demo Source database Demo_ODS Demo_ODSu

Demo Target database Target_ODS DemoTargetu

xxvi Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Testing the hardware and software setupTo ensure the hardware and software are set up properly, perform these steps on your machine.

Data Integrator sets up a Windows service for the Job Server. You should have two services: the Data Integrator Service and the Data Integrator Web Server.

To verify Data Integrator services have been installed1 Click Start, point to Settings and Control Panel. Double-click

Administrative Tools and Services.2 Under Services, you should see the Data Integrator Service and the Data

Integrator Web Server.Tip: If you are missing the Data Integrator Service you can install this service (al_jobservice.exe) manually from your command line prompt. • Go to Start > Run and type cmd in the open text box.• In the CMD.exe window, go to the drive where you installed Data

Integrator and run the al_jobservice.exe. For example: C:\Program Files\Business Objects\ Data Integrator 11.5\al_jobservice.exe -install.

To verify that Job Servers are running1 On your desktop, right-click the Task bar, and select Task Manager.2 Click the Processes tab and check for the following services are

available:• al_jobservice.ex (represents the Data Integrator service)• al_jobserver.ex (one per Job Server)

Computer Setup—Learner’s Guide xxvii

Setting up for the activities

You must setup the sample database and files needed to complete the activities.1 In your Data Integrator installation directory, using Windows Explorer,

browse to \Tutorial Files\Scripts.2 Right-click CreateTables_MSSQL.bat and select Edit.

This .bat file opens in Notepad and displays two commands in this format:isql /e /n /U username/S servername/d databasename /P password/i ODS_MSSQL.sql/o Tutorial_MSSQL.out

3 For the first .bat file, edit the following:• Replace username with ODSuser• Replace servername with localhost - if your machine name is

different than localhost, use your machine’s name.• Replace databasename with ODS• Delete password and leave blankYour .bat file should look like this:isql /e /n /U ODSuser/S localhost/d ODS/P /i ODS_MSSQL.sql/o Tutorial_MSSQL.out

4 Repeat the same procedure using the values below from the Target database that you created in MS SQL Server earlier to edit the second.bat file:• Replace username with targetuser• Replace servername with localhost - if your machine name is

different than localhost, use your machine’s name.• Replace databasename with Target• Delete password and leave blank

5 Save the changes.6 Go back to the \Tutorial Files\Scripts and double-click

CreateTables_MSSQL.bat to run the.bat file.7 Go back to MS SQL Enterprise Manager, navigate to the Databases

folder. 8 Expand the ODS and Target databases.

In addition to the system tables, the following tables should exist in your ODS source database tables after you run the script:

Descriptive Name Table name in database

Customer ODS_Customer

Employee ODS_Employee

Material ODS_Material

Region ODS_Region

Sales Delivery ODS_Delivery

xxviii Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

In addition to the system tables, the following tables should exist in your target database tables after you run the script:

To set up the demo database (optional):1 Insert the Instructor Resource CD into your computer.2 Navigate to your CD drive, browse to the Guided_Demo_Scripts folder,

and open it.3 Copy these files to your Data Integrator installation directory: \Tutorial

Files\Scripts:• CreateDEMOTables_MSSQL.bat• Demo_ODS_MSSQL.sql• Demo_Target_MSSQL.sql

4 After the files are copied, double-click CreateDEMOTables_MSSQL.bat.This will create the tables and fields required and also populates the DEMO_ODS database with data.Instructions for setting up the demo database are also available on the Instructor Resource CD in this file: README_For_Demo_DBs.txt

Sales Order Header ODS_Salesorder

Sales Order Line Item ODS_Salesitem


CDC status CDC_time

Customer Dimension cust_dim

Employee Dimension employee_dim

Material Dimension mtrl_dim

Sales Fact sales_fact

Sales Org Dimension salesorg_dim

Recovery Status status_table


Computer Setup—Learner’s Guide xxix

Getting HelpIf you encounter difficulties installing the software when recreating the learning environment, refer to the Installation document found on the product CD or contact Business Objects Customer Support. For a current contact list visit http://support.businessobjects.com

xxx Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Copyright © 2005 Business Objects SA. All rights reserved. 1-1

Lesson 1 Data Warehousing Concepts

To understand Data Integrator, you need a basic knowledge of Data Warehousing concepts. This lesson is a review of the main concepts that will better enable you to grasp the potential of Data Integrator.

In this lesson you will learn about:• Understanding Data Warehousing Concepts

Duration: 30 minutes

1-2 Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Understanding Data Warehousing Concepts

After completing this unit, you will be able to:• Describe normal forms of data• Explain dimensional modeling

NormalizationIn creating a database, normalization by which data is organized into tables such that results from the database are always unambiguous and as intended.

Normalization may have the effect of duplicating data within the database and often results in the creation of additional tables. Although normalization tends to increase the duplication of data, it does not introduce redundancy, which is unnecessary duplication. Normalization is typically a refinement process after you have identified the data objects that belong in the database, identified the relationships between those data objects, and defined the tables required and the columns within each table.

A simple example of normalizing data might consist of a table showing:

If this table is used for the purpose of keeping track of the price of items and you want to delete one of the customers, you remove also a price. Normalizing the data means understanding this and solving the problem by dividing this table into two tables, one with information about each customer and a product they bought and the second with each product and its price. Making additions or deletions to either table does not affect the other.

Introduction

Normal forms of data

CustomerItem Purchased

Purchase Price

Thomas Shirt $40

Maria Tennis shoes $35

Evelyn Shirt $40

Pajaro Trousers $25

Data Warehousing Concepts—Learner’s Guide 1-3

The five forms of normal data

Flattening hierarchies to normalize or denormalize dataExamples of principles for flattening hierarchies:

Normal Form Description

First normal form All occurrences of a record type must have the same number of fields.

Second normal form Each non-key field in a record should be informa-tion about what is uniquely identified by the key field.

Third normal form Each non-key field in a record with a compound key should be information about the entity identified by the compound key.

Fourth normal form Records should not have multiple independent one-to-many relationships.

Fifth normal form Records should not have unnecessary redundancy.


Horizontally flattened hierarchyFlattening the hierarchy minimizes the number of lookups required for a high level report.

This is the traditional way of flattening a hierarchy, where each row represents a leaf node in the hierarchy. The advantage is that no additional lookups are required; the complete ancestry of a leaf node in the hierarchy is repeated for each row. This solves the performance problem inherent in reporting on normalized data, but leaves the hierarchy rigid.

Recursive hierarchy

In a recursive hierarchy, each record points back to its parent. This allows the structure of the hierarchy to flex without changing the table schema. Data is still fully normalized; multiple lookups are required for a SELECT statement.

Vertically flattened hierarchy

The data you need about the relationship between any two related nodes in the hierarchy is provided in a single row with no lookups required. Data Integrator gives you options to generate either a horizontally or vertically flattened hierarchy.


Dimensional modelling is a logical data design technique that aims to represent data in an intuitive standard framework that allows for high performance access.

Components that make up dimension tables

Some examples are:

Dimensional modelling

Component Description

Fact table A table containing business data generated and/or changed regularly through normal business transactions. Sometimes it is called a transaction table. Examples are a Sales table, an Orders table.

Dimension table

A table containing detailed descriptions of one aspect of a transaction, for example, the product dimension of a sales order.

Conformed dimensions

Dimensions designed to describe multiple fact tables. Con-formed dimensions create a data warehouse bus architec-ture linked across fact tables.

Star schema A set of one or more fact tables and all related dimension tables.

Slowly changing dimensions

The master data in a dimension table, while less volatile than transaction data in a fact table, still changes from time to time.

For example, when a customer changes his/her name:Peggy Sevenster becomes Peggy Dougal.

OrWhen a product changes name:33oz Diet Coke becomes 33.5 oz Light CokeType 1, 2, 3 slowly changing dimensions (SCDs) provide classic ways to handle such changes.


Types of Slowly Changing Dimensions (SCD)

SCD Type 1For a Type 1 change, you find and update the appropriate attributes on a specific dimensional record. For example, to update a record in the SALES_PERSON_DIMENSION to show a change to an individual’s SALES_PERSON_NAME field, you simply update one record in the SALES_PERSON_DIMENSION table. This action would update or correct that record for all fact records across time. In a dimensional model, facts have no meaning until you link them with their dimensions. If you change a dimensional attribute without appropriately accounting for the time dimension, the change becomes global across all fact records.

However, suppose a salesperson transfers to a new sales team. Updating the salesperson’s dimensional record would update all previous facts so that the salesperson would appear to have always belonged to the new sales team. If you want to preserve an accurate history of who was on which sales team, a Type 1 change might work for the Sales Name field, but not for the Sales Team field. In this case, a Type 3 change is appropriate.

SCD Type 3 To solve the sales team problem, let’s look at it as a Type 3 change (because of the complexity, we’ll discuss Type 2 changes last in these examples). To implement the Type 3 change, remove the dimension structure slightly so that it adds two attributes, old and new Sales Team fields, and renames one attribute to record the date of the change.

A Type 3 implementation has three advantages:• You can preserve only one change per attribute - old and new or first and

last.• Each Type 3 change requires a minimum of one additional field per

attribute and two additional fields if you want to record the date of the change.

Type Description

Type 1No history preservation

• Natural consequence of normalization

Type 3Limited history preservation

• Two states of data are preserved: current and old

• New fields are generated to store history data

• Requires an Effective_Date field

Type 2Unlimited history preservationNew rows

• New rows generated for significant changes

• Requires use of a unique key The key relates to facts/time

• Optional EffectiveTo_Date field


• Although the dimension’s structure contains all the data needed, the SQL code required to extract the information can be complex. Extracting a specific value is not difficult, but if you want to obtain a value for a specific point in time or multiple attributes with separate old and new values, the SQL statements become long and have multiple conditions. Overall, Type 3 changes can store the data of a change, but they can neither accommodate multiple changes, nor remove adequately serve the need for summary reporting.

The following example assumes today’s date is November 1 2004.

SCD Type 2A Type 2 change answers the sales team dilemma. With a Type 2 change, you do not need to make structural changes to the SALES_PERSON_DIMENSION table, but you need to add a record. Suppose you have a record that looks like:

Sales Team Record

After you implement the Type 2 change, two records appear, as in the following table:

Each record is related to appropriate facts, which are related to specific points in time in the Time Dimension.

SALES_PERSON_KEY

SALES_PERSON_ID

NAME SALES_TEAM

15 000120 Doe, John B Northwest

SALES_PERSON_KEY

SALES_PERSON_ID

NAME SALES_TEAM

15 000120 Doe, John B Northwest

133 000120 Doe, John B Southeast


Lesson Summary

Quiz: Data Warehousing Concepts1 Explain the difference between a recursive hierarchy and a vertically

flattened hierarchy

2 What is dimensional modelling?

3 What is an SCD? List some of the advantages of Type 2 SCD. When might Type 1 be the best solution?

After completing this lesson, you are able to:• Describe normal forms of data• Explain dimensional modeling

Review

Summary


Lesson 2Understanding Data Integrator

As a Data Manager, part of your responsibilities may include identifying and integrating data from dissimilar data sources and outputting the data into a target data warehouse. Integrating the identified source data also requires transformation of the data itself in accordance to the business rules and requirements identified at your organization.

Data Integrator is a work space for staging the Extracting, Transforming, and Loading (ETL) of data from multiple sources into a common target database or data warehouse for analysis.

In this lesson you will learn about:• Understanding Data Integrator benefits and functions• Understanding Data Integrator architecture• Understanding Data Integrator objects• Using the Designer interface

Duration: 1 hour


Understanding Data Integrator benefits and functions

Data Integrator facilitates the process of organizing data from dissimilar sources through a graphical interface into a common datastore for analysis.

Data Integrator also manages data access to data sources and delivers data to analytic, supply-chain management, customer relationship management, and web applications.

Data Integrator Extracts, Transforms, and Loads (ETL) data from heterogeneous sources into a single datastore using jobs that organize work flows and data flows in both real-time and batch use. Although Data Integrator can be used for both real-time and batch jobs, this course only covers ETL using batch jobs.

After completing this unit, you will be able to:• List Data Integrator’s benefits• Describe key Data Integrator performance functions

The Business Objects Data Integration Platform enables you to develop enterprise data integration for batch uses. Your enterprise can:• Create a single infrastructure for batch movement to enable faster and

lower cost implementation• Manage data as a corporate asset independent of any single system • Integrate data across many systems and reuse that data for many

purposes• Improve performance• Reduce burden on enterprise systems• Pre-package data solutions for fast deployment and quick ROI

Introduction

Data Integrator benefits

Understanding Data Integrator—Learner’s Guide 2-3

Single point of integrationData Integrator combines both batch data movement and management with intelligent caching to provide a single data integration platform for information management from any information source and for any information use. This unique combination allows you to:• Stage data in an operational datastore, data warehouse, or data mart• Update staged data in batch or real-time modes• Direct XML-based requests (queries or updates) to the staged data or

directly to back-office systems. User-specified rules control request routing

• Create a single graphical development environment for developing, testing, and deploying the entire data integration platform

• Manage a single metadata repository to capture the relationships between different extraction and access methods and provide integrated lineage and impact analysis

Data Integrator performs three key functions that can be combined to create a scalable, high-performance data platform.

The three key functions are:• Loading data: Populate ERP or enterprise application data into an

operational datastore (ODS) or analytical data warehouse, and update in batch-time

• Routing requests: Intelligently create information requests to a data warehouse or ERP system using complex rules

• Applying transactions: Apply transactions against ERP systems

Loading dataData mapping and transformation can be specified using the Data Integrator Designer graphical user interface. Data Integrator automatically generates the appropriate interface calls to access the data in the source system.

For most ERP applications, Data Integrator generates SQL optimized for the specific target database (Oracle, DB2, SQL Server, Informix, and so on).

Automatically generated, optimized code reduces the cost of maintaining data warehouses and enables you to build data solutions quickly, meeting user requirements faster than other methods (for example, custom-coding, direct-connect calls, or PL/SQL).

Applying transactionsData Integrator can apply data changes in a variety of data formats and any custom format using a Data Integrator adapter. Enterprise users can apply data changes against multiple back-office systems singularly or sequentially. By generating calls native to the system in question, Data Integrator makes it unnecessary to develop and maintain customized code to manage the process. You can design access intelligence into each transaction by adding

Key Data Integrator performance functions


flow logic that checks values in a data warehouse or in the transaction itself before running a transaction directly against the target ERP system.


Understanding Data Integrator architecture

Data Integrator relies on several unique components to accomplish the ETL activities required to manage your corporate data platform.

After completing this unit, you will be able to:• Describe standard Data Integrator components• Describe Data Integrator management tools

Data Integrator includes the following standard components:• Designer• Repository• Service: This includes the Job Server and Access Servers

Note: We do not discuss the Access Servers service since the Access Servers are only used in real-time jobs.

• Web Server: This includes the Administrator and the Metadata Reporting

This diagram illustrates the relationships between these components:

Introduction

Standard Data Integrator components

Metadata Reporting

Designer

Administrator

Engines

Data Integrator Service

Job Server

Repository

Access Server

Real-time Services


Data Integrator DesignerThe Designer is the development interface that allows you to create, test, and manually execute jobs that populate a data warehouse. Using the Designer you create data management applications that consist of data mappings, transformations, and control logic.

You can create objects that represent data sources, then drag, drop, and configure them by selecting icons in flow diagrams, table layouts, and nested workspace pages.

The Designer interface also allows you to manage metadata stored in a Data Integrator repository. From the Designer, you can also trigger the Job Server to run your jobs for initial application testing.

Data Integrator repositoryThe Data Integrator repository is a set of tables that hold user-created and predefined system objects, source and target metadata, and transformation rules. It is set up on an open client/server platform to facilitate sharing metadata with other enterprise tools. Each repository is stored on an existing RDBMS.

There are two types of repositories:• A local repository is used by an application designer to store definitions of

Data Integrator objects (like projects, jobs, work flows, and data flows) and source/target metadata.

• A central repository is an optional component that can be used to support multi-user development. The central repository provides a shared object library allowing developers to check objects in and out of their local repositories.

Each repository is associated with one or more Data Integrator Job Server. Job Servers run the jobs.

When you are designing a job, you can run it from the Designer. The Designer tells the Job Server to run the job. The Job Server then gets the job from its associated repository and starts an engine to process the job.

During production, the Job Server runs jobs triggered by a scheduler. In production environments, you can balance loads appropriately because you can link multiple repositories to a single Job Server.


Data Integrator ServiceThe Data Integrator Service is installed when Data Integrator Job and Access Servers are installed. The Data Integrator Service starts Job Servers when you restart your system. The Windows service name is Data Integrator Service.

Job Server and engineThe user defines the Data Integrator Job Server that executes the data movement jobs that you specify. When these jobs are executed, the Data Integrator engine starts the data movement engine that integrates data from multiple heterogeneous sources, performs complex data transformations, and manages extractions and transactions in the ETL process.

The Data Integrator engine processes use parallel pipelining and in-memory data transformations to deliver high data throughput and scalability.

Data Integrator Web ServerThe Data Integrator Web Server supports browser access to the Administrator and the Metadata Reporting tool. The Windows service name for this server is Data Integrator Web Server. The UNIX equivalent is the AL_JobService. Both use a Tomcat servlet engine to support browser access.

AdministratorThe Administrator provides browser-based administration of Data Integrator resources including:• Scheduling, monitoring, and executing batch jobs• Configuring, starting, and stopping real-time services• Configuring Job Server, Access Server, and repository usage• Configuring and managing adapters• Managing users• Publishing batch jobs and real-time services via web services


Metadata Reporting toolThis tool provides browser-based reports on Enterprise metadata stored in the Data Integrator repository. Reports include:• Datastore Analysis: View overall datastore information, including, table,

function, overview, and hierarchy reports to determine:• Which data sources populate the datastore tables• Which target tables the source tables populate

• Operational Statistics: Investigate job and dataflow execution. This category includes graphical reports.

• Auto Documentation Reports: use these reports to conveniently generate printed documentation to capture an overview of an entire ETL project and critical information for the objects you create in Data Integrator.

The following options are also available with Data Integrator and BusinessObjects Enterprise: • Datastore Analysis: Use the reports describe to see whether one or more

of the following Business Intelligence reports uses data from tables that are contained in:

• Business Views• Crystal Reports• Universes• WebIntelligence documents• Desktop Intelligence documents

• Dependency Analysis: Search for specific objects in your repository and understand how those objects impact or are impacted by other Data Integrator or Business Objects universe objects and reports. Metadata search results provide links back into associated reports.

• Universe Analysis: View universe class, and object lineage. Universe users can determine what data sources populate their universes and what reports use their universes.

• Business View Analysis: View the data sources for Business Views in the Central Management Server (CMS).

• Report Analysis: View data sources for reports in the Central Management Server (CMS). You can view table and column lineage reports for each Crystal Reports, Desktop Intelligence, and WebIntelligence Document managed by the CMS. Report writers can determine what data sources populate their reports.


There are several management tools that assist in managing your Data Integrator components: Repository Manager and Server Manager.

Repository ManagerThe Repository Manager allows you to create, upgrade, and check the versions of local and central repositories.

Server ManagerThe Server Manager allows you to add, delete, or edit the properties of Job Servers. It is automatically installed on each computer on which you install a Job Server

Use the Server Manager to define links between Job Servers and repositories. You can link multiple Job Servers on different machines to a single repository (for load balancing) or each Job Server to multiple repositories (with one default) to support individual repositories (separating test from production, for example).

Data Integrator management tools


Understanding Data Integrator objects

After completing this unit, you will be able to:• Explain object characteristics• Explain relationships between objects• Describe the Data Integrator development process

In Data Integrator, all entities you add, define, modify, or work with are objects. Some of the most frequently used objects are:• Projects• Jobs• Work flows• Data flows• Scripts• Transforms

This diagram shows some common objects.

Introduction

Data Integrator objects


All objects have options, properties, and classes. Each can be modified to change the behavior of the object.

An object’s class determines how you create and retrieve the object. There are two classes of objects:• Single-use objects• Reusable objects

Single-use objectsSingle-use objects appear only as components of other objects. They operate only in the context in which they were created. Note: You cannot copy single-use objects.

Reusable objectsA reusable object has a single definition and all calls to the object refer to that definition. If you change the definition of the object in one place, and then save the object, the change is reflected to all other calls to the object.

Most objects created in Data Integrator are available for reuse. After you define and save a reusable object, Data Integrator stores the definition in the repository. You can then reuse the definition as often as necessary by creating calls to the definition.

For example, a data flow within a project is a reusable object. Multiple jobs, such as a weekly load job and a daily load job, can call the same data flow. If this data flow is changed, both jobs call the new version of the data flow.

You can edit reusable objects at any time independent of the current open project. For example, if you open a new project, you can open a data flow and edit it; but the changes you make to the data flow are not kept until you save them.

Saving objects When you save a single-use or reusable object in Data Integrator, you are storing the language that describes the object to the repository.

The description of a single-use object can only be saved as part of the reusable object that calls the single-use object.

Options Options control the object. For example, to set up a connec-tion to a database, defining the database name is an option for the connection.

Properties Properties describe the object. For example, the name and creation date. Attributes are properties used to locate and organize objects.

Classes Every object is one of two classes, either reusable or single-use. Reusable objects have a single definition and can be copied from the Designer Local object library.


The description of a reusable object includes:• Properties of the object• Options for the object• Calls this object makes to other objects• Definition of single-use objects called by this objectNote: If an object contains a call to another reusable object, only the call to

the second object is saved, not changes to that object’s definition.

You can either use the Save option to save open objects in the Designer workspace or use the Save All option to save all objects that have changes in the repository. Note: If a single-use object is open in the workspace, the Save command is

not available.

The next unit in this lesson discusses the Data Integrator Designer in more detail.

Jobs are composed of work flows and/or data flows:• A work flow is the incorporation of several data flows into a coherent flow

of work for an entire job• A data flow is the process by which source data is transformed into target

data

A work flow orders data flows and operations that support them; a work flow also defines the interdependencies between data flows. For example, if one target table depends on values from other tables, use the work flow to specify the order in which you want Data Integrator to populate the tables. Also use work flows to define strategies for handling errors that occur during project execution. You can also use work flows to define conditions for running sections of a project.

This diagram illustrates a typical work flow.

A data flow defines the basic task that Data Integrator accomplishes, which involves moving data from one or more sources to one or more target tables or files. You define data flows by identifying the sources from which to extract data, the transformations the data should undergo, and targets.

Relationship between objects


Projects and jobsA project is the highest-level object in the Data Integrator Designer window. Projects provide you with a way to organize the other objects you create in Data Integrator. Only one project is open at a time, where open means visible in the project area.

The single project open below is named Class_Exercises.

A job is the smallest unit of work that you can schedule independently for execution. A project is a reusable object that allows you to group jobs. You can use a project to group jobs that have schedules that depend on one another or that you want to monitor together.

Projects have common characteristics:• Projects are listed in the Local object library• Only one project can be open at a time• Projects cannot be shared among multiple users

The objects in a project appear hierarchically in the project area. If a plus sign (+) appears next to an object, expand it to view the lower-level objects contained in the object. Data Integrator displays the contents as both names and icons in the project area hierarchy and in the workspace.Note: Jobs must be associated with a project before they can be executed in

the project area of the Designer.


Object hierarchyIn the repository, the designer groups objects hierarchically from a project, to jobs, to optional work flows, to data flows. In jobs, work flows define a sequence of processing steps, and data flows transform data from source(s) to target(s).

This illustration shows the hierarchical relationships for the key object types within Data Integrator.

Data Integrator object hierarchy


However, for the purpose of this course we focus on creating batch jobs using database datastores and file formats as shown in this diagram.

Focus of object hierarchy for this course


The development process you use to create your ETL application involves three distinct phases: design, test, and production.• Design

In the design phase, you define objects and build diagrams that instruct Data Integrator in your data movement requirements. Data Integrator stores these specifications so you can reuse them or modify them as your system evolves.

• TestIn the testing phase, you use Data Integrator to test the execution of your application. At this point, you can test for errors and trace the flow of execution without exposing production data to any risk. If errors appear in this phase, return the application to the design phase for correction and test again.

• ProductionIn the production phase, you set up a schedule in Data Integrator to run your application as a job. You can evaluate results from production runs and when necessary, return to the design phase to optimize performance and refine your target requirements.

The Data Integrator development process

Test the job

Without exposing production data

Design the job

Define data movement

requirements

Production Phase

Use the results to refine the job


Using the Data Integrator Designer interface

The Data Integrator Designer interface allows you to plan and organize your extraction, transformation, and loading of applications in a visual way.

Most of the components of Data Integrator can be programmed through this interface.

After completing this unit, you will be able to:• Explain how the Designer is used• Describe key areas in the Data Integrator Designer window

You use Data Integrator to design, produce, and run data movement applications. Using the Designer, you create batch jobs to contain, organize, and execute work flows and data flows that specify data extraction, transformation, and loading processes.

You can set applications to run in test mode or on a specific schedule.

Key areas of the Designer windowThe Data Integrator Designer interface consists of a single application window and several embedded supporting windows. The application window contains the Menu bar, Toolbar, Local object library, Project area, Tool palette, and Workspace.

Tip: You can access the Data Integrator Technical Manuals for reference or help through the Designer interface Help menu. These manuals are also accessible by going through Start>Programs>BusinessObjects Data Integrator 11.5>Data Integrator Documentation> Technical Manuals.

Introduction

Designer interface


ToolbarIn addition to many of the standard Windows toolbar buttons, Data Integrator provides unique toolbar buttons, including:

Icon Tool Description

Close all win-dows

Closes all open windows in the workspace.

Local object library

Opens and closes the Local object library win-dow.

Central object library

Opens and closes the Central object library window.

Variables Opens and closes the variables and parame-ters creation window.

Project Area Opens and closes the project area.

Output Opens and closes the output window.

Diff View Displays the differences between objects.

View Enabled Descriptions

Enables the system level setting for viewing object descriptions in the workspace.

Validate Current View

Validates the object definition open in the workspace. Other objects included in the defi-nition are also validated.

Validate All Objects in View

Validates the object definition open in the workspace. Objects included in the definition are also validated.

Audit Opens the Audit window. You can collect audit statistics on the data that flows out of any Data Integrator object.

View Where Used

Opens the Output window, which lists parent objects (such as jobs) of the object currently open in the workspace (such as a data flow).

Go Back Move back in the list of active workspace win-dows.


Local object libraryThe Local object library gives you access to the object types listed in the table below. The table shows the tab on which the object type appears in the Local object library and describes the Data Integrator context in which you can use each type of object.

Go Forward Move forward in the list of active workspace windows.

Metadata Reports

Opens and closes the Metadata Reports win-dow.

About Opens the Data Integrator About box, with product component version numbers and a link to the Business Objects Web site.

Tab Description Class

Projects are sets of jobs available at a given time. Single-use

Jobs are executable work flows. There are two job types: batch jobs and real-time jobs.

Reusable

Work flows order data flows and the operations that support data flows, defining the interdependencies between them.

Reusable

Data flows describe how to process a task. Reusable

Transforms operate on data, producing output data sets from the sources you specify. The Local object library lists both built-in and custom transforms.

Reusable

Datastores represent connections to databases and applications used in your project. Under each datastore is a list of the tables, documents, and functions imported into Data Integrator.

Reusable

Formats describe the structure of a flat file, XML file, or XML message.

Reusable

Custom Functions are functions written in the Data Integrator Scripting Language. They can be used in Data Integrator jobs.

Reusable


Project areaThe project area provides a hierarchical view of the objects used in each project. Tabs on the bottom of the project area support different tasks. Tabs include:

Tab Description

Create, view, and manage projects. Provides a hierarchical view of all objects used in each project.

View the status of currently executing jobs. Selecting a spe-cific job execution displays its status, including which steps are complete and which steps are executing. These tasks can also be done using the Data Integrator Administrator.

View the history of complete jobs. Logs can also be viewed with the Data Integrator Administrator.


To change the docked position of the project area1 Right-click the gray border of the project area.2 Select Allow Docking. 3 Click and drag the project area to dock at and undock from any edge

within the Designer window. When you drag the project area away from a Designer window edge, it stays undocked.

4 To quickly switch between your last docked and undocked locations, just double-click the gray border.

To change the undocked position of the project area1 Right-click the gray border of the project area.2 Click Allow Docking to remove the checkmark and deselect docking.3 Click and drag the project area to any location on your screen and it will

not dock inside the Designer window.

To hide/show the project area1 Right-click the gray border of the project area.2 Select Hide from the menu.

The project area disappears from the Designer window. To show the project area, click its toolbar icon.

This is an example of the Project window’s Designer tab, which shows the project hierarchy:

As you drill down into objects in the Designer workspace, the window highlights your location within the project hierarchy.


Tool paletteThe tool palette is a separate window that appears by default on the right edge of the Designer workspace. You can move the tool palette anywhere on your screen or dock it on any edge of the Designer window.

The icons in the tool palette allow you to create new objects in the workspace. The icons are disabled when they are not allowed to be added to the diagram open in the workspace.

To show the name of each icon, hold the cursor over the icon until the tool tip for the icon appears.

When you create an object from the tool palette, you are creating a new definition of an object. If a new object is reusable, it will be automatically available in the Local object library after you create it.

For example, if you select the data flow icon from the tool palette and define a new data flow called DF1, you can later drag that existing data flow from the Local object library and add it to another data flow called DF2.

The tool palette contains these objects:

Icon Tool Description and Class Available

Pointer Returns the tool pointer to a selection pointer for selecting and moving objects in a diagram.

Everywhere

Work flow Creates a new work flow. Reusable object.

Jobs and workflows

Data flow Creates a new data flow.Reusable object.

Jobs and workflows

R/3 data flow

Used only with the SAP licensed extension.

Querytransform

Creates a template for a query. Use it to define column mappings and row selections. Single-use object.

Data flows

Template table

Creates a table for a target. Single-use object.

Data flows

Template XML

Creates an XML template.Single-use object.

Data flows

Data transport

Used only with the SAP Licensed extension.

Script Creates a new script object.Single-use object.

Jobs and work flows

Condi-tional

Creates a new conditional object.Single-use object.

Jobs and work flows


WorkspaceWhen you open or select a job or any flow within a job hierarchy, the workspace becomes active with your selection. The workspace provides a place to manipulate system objects and graphically assemble data movement processes.

These processes are represented by icons that you drag and drop into a workspace to create a workspace diagram. This diagram is a visual representation of an entire data movement application or some part of a data movement application.

You specify the flow of data through jobs and work flows by connecting objects in the workspace from left to right in the order you want the data to be moved.

Try Creates a new Try object that tries an alternate work flow if an error occurs in a Data Integrator job.Single-use object.

Jobs and work flows

Catch Creates a new Catch object that catches errors in a Data Integra-tor job.Single-use object.

Jobs and work flows

While Loop

Repeats a sequence of steps in a work flow as long as a condition is true.Single-use object.

Work flows

Annota-tion

Creates an annotation.Single-use object.

Jobs, work flows,and data flows

Icon Tool Description and Class Available


Lesson Summary

Quiz: Understanding Data Integrator1 List two benefits of Data Integrator.

2 Which of these objects is Single-Use?• Project• Job• Work flow• Data flow

3 What is the difference between a repository and a datastore?

4 Place these objects in order by their hierarchy: Data Flows, Job, Project, and Work Flows.

After completing this lesson, you are now able to:• List Data Integrator’s benefits• Describe key Data Integrator performance functions• Describe standard Data Integrator components• Describe Data Integrator management tools• Explain object characteristics• Explain relationships between objects• Describe the Data Integrator development process• Explain how the Designer is used• Describe key areas in the Data Integrator Designer window

Review

Summary


Lesson 3Defining Source and Target Metadata

To define data movement requirements in Data Integrator, first you must import your source and target metadata.

In this lesson you will learn about:• Using datastores• Importing metadata• Defining a file format

Duration: 1 hour


Using datastores

Datastores represent connections between Data Integrator and databases or applications, directly or through adapters.

Datastore connections allow Data Integrator to access metadata from a database or application and read from or write to that database or application. Adapters provide access to a third party application’s data and metadata.

After completing this unit, you will be able to:• Explain what a datastore is• Create a database datastore• Change a datastore definition

A datastore provides a connection or multiple connections to data sources such as a database. Through the datastore connection, Data Integrator can import metadata from the data source, such as descriptions of fields. When you specify tables as sources or targets in a data flow, Data Integrator uses the datastore to determine how to read data from or load data to those tables. In addition, some transforms and functions require a datastore name to qualify the tables they access.

Data Integrator uses datastores to read data from source tables or load data to target tables. Each source or target must be defined individually and the datastore options available depend on which RDBMS or application is used for the datastore.

To define a datastore, you must have an adequate account access to the database or application hosting the desired data.

Data Integrator reads and writes data stored in flat files through flat file formats. It reads and writes data stored in XML documents through DTDs and XML Schemas.

The specific information that a datastore contains depends on the connection. When your database or application changes, you must make corresponding changes in the datastore information in Data Integrator.Note: Data Integrator does not automatically detect structural changes to the

datastore.

Database datastores can represent single or multiple connections with: • IBM DB2, Informix, Microsoft SQL Server, Oracle, Sybase, and

Teradata databases (using native connections)• Other databases (through ODBC)• A simple memory storage mechanism using a memory datastore• IMS, VSAM, and various additional legacy systems using

BusinessObjects Data Integrator Mainframe Interfaces such as Attunity and IBM Connectors.

Introduction

Explaining what a datastore is

Defining Source and Target Metadata—Learner’s Guide 3-3

Metadata consists of:• Database tables:

• Table name• Column names• Column data types• Primary key columns• Table attributes

• RDBMS functions• Application-specific data structures

There are three kinds of datastores: database datastores, application datastores, and adapter datastores.

Database datastores provide a simple way to import metadata directly from a broad variety of RDBMS.

Application datastores let users easily import metadata from most Enterprise Resource Planning (ERP) systems.

Adapter datastores can provide access to an application’s data and metadata or just metadata. For example, if the data source is SQL-compatible, the adapter might be designed to access metadata, while Data Integrator extracts data from or loads data directly to the application.

Depending on the adapter implementation, adapters can provide:• Application metadata browsing.• Application metadata importing into the Data Integrator repository.

For batch and real-time data movement between Data Integrator and applications, Business Objects offers an Adapter Software Development Kit (SDK) to develop your own custom adapters. Also, you can buy Data Integrator prepackaged adapters to access application metadata and data in any application. For more information on these products, refer to your product documentation.

For more information on adapter datastores, see “Adapter Datastores”, Chapter 5 in the Data Integrator Designer Guide.

Note: You can import metadata from BusinessObjects Enterprise using adapters. The adapter for BusinessObjects Enterprise is the Data Mart Accelerator for Crystal Reports.

See the documentation folder under Adapters located in your Data Integrator installation for more on the Data Mart Accelerator for Crystal Reports. For a default installation, see C: \Program Files\Business Objects\Data Integrator 11.5\adapters\crystaladapter.


You need to create at least one datastore for each database file system with which you are exchanging data.

When you create a datastore you can specify one datastore configuration at a time as the default datastore configuration. Data Integrator uses the default configuration to import metadata and execute jobs.

To create a datastore, you must have appropriate access privileges to the database or file system that the datastore describes. If you do not have access, ask your database administrator to create an account for you.

To create a database datastore1 In the Datastores tab of the Local object library, right-click and select

New.

2 Enter the name of the new datastore in the Datastore Name field.The name can contain any alphanumeric characters or underscores (_). It cannot contain spaces.

3 Select the Database type.The options available are: Attunity, DB2, IBM Connector, Informix, Memory, Microsoft SQL Server, ODBC, Oracle, Sybase ASE, Sybase IQ, and Teradata.The values you select for the datastore type and the database type in the Datastore Editor determine the options available when you create a database datastore. The minimum number of entries that you must make to create a datastore depends on the selections you make for the first two options.

4 Click OK.Note: When you create a datastore, you can also create Datastore

configurations by clicking the Advanced button. Datastore configurations are useful when you work in a multi-user environment that requires you use more than one datastore configuration to migrate between development, test, and production phases of your Data Integrator project. This feature is covered in more detail later in Lesson 12.

Creating a database datastore


You will also see an option to Enable CDC (Capture Changed Data). This option enables log-based capture of changed data on SQL Server databases and applies it to a target system. To capture changed data, Data Integrator interacts with SQL Replication Server.You learn about time-stamped CDC later in this course only because log-based data is more complex and is outside the scope of this course. For more information on using CDC with Microsoft SQL databases see “Techniques for Capturing Data”, Chapter 19 in the Data Integrator Designer Guide.

Like all Data Integrator objects, datastores are defined by both options and properties:

Options control the operation of objects. These include the database server name, database name, user name and password for the specific database.

The datastore editor allows you to edit all connection properties except datastore name and datastore type for adapter and application datastores. For database datastores, you can edit all connection properties except datastore name, datastore type, database type and database version.

Properties document the object. For example, the name of the datastore and the date on which it was created are datastore properties. Properties are merely descriptive of the object and do not affect its operation.

To change datastore options1 In the Datastores tab of the Local object library, right-click the datastore

name and select Edit.The datastore editor appears in the workspace.

2 Change the database server name, database name, username and password options.Note: These changes take effect immediately.

Changing a datastore definition

Properties Description

General Tab Contains the name and description (if available) of the datastore. The datastore name appears on the object in the Local object library and in the calls to the object. You cannot change the name of a datastore after creation.

Attributes Tab Includes the date you created the datastore. This value can-not be changed.

Class Attribute Tab

Includes overall datastore information such as description and date created.


To view datastore properties1 In the Datastores tab of the Local object library, right-click the datastore

name and select Properties.The Properties window lists a datastore’s description, attributes, and class attributes.

2 Change the datastore properties, and click OK.


Importing metadata

After completing this unit, you will be able to:• List types of metadata• Capture metadata information from imported data• Import metadata by browsing

You can import metadata into the Data Integrator repository for:• Database tables • RDBMS functions

Data Integrator stores the following table information:• Table name• Attributes• Indexes• Column names• Descriptions• Data types• Primary keys

Imported table informationData Integrator determines and stores a specific set of metadata information for tables. After importing metadata, you can edit column names, descriptions, and data types. The edits are propagated to all objects that call these objects.

Introduction

Listing types of metadata

Capturing metadata information from imported data

Metadata Description

Table name The name of the table as it appears in the database.

Table description The description of the table.

Column name The name of the table column.

Column descrip-tion

The description of the column.


Imported stored function and procedure informationYou can import stored procedures from DB2, MS SQL Server, Oracle, and Sybase databases. You can also import stored functions and packages from Oracle. You can use these functions and procedures in the extraction specifications you give Data Integrator.

Information that is imported for functions includes:• Function parameters• Return type• Name, owner

Imported functions and procedures appear on the Datastores tab of the Local object library. Functions and procedures appear in the Function branch of each datastore tree.

You can configure imported functions and procedures through the function wizard and the smart editor in a category identified by the datastore name.

Column data type The data type for each column.If a column is defined as an unsupported data type, Data Integrator converts the data type to one that is supported. In some cases, if Data Integrator cannot convert the data type, it ignores the column entirely.

Primary key col-umn

The column(s) that comprise the primary key for the table.After a table has been added to a data flow diagram, these columns are indicated in the column list by a key icon next to the column name.

Table attribute Information Data Integrator records about the table such as the date created and date modified if these values are available.

Owner name Name of the table owner. Lists metadata table and function information Data Integrator imports.

Metadata Description


You can import metadata by name, searching, and browsing. We focus on importing metadata by browsing in this course. For more information on importing by searching and importing by name, see “Ways of importing metadata”, Chapter 5 in theData Integrator Designer Guide.

To import metadata by browsingNote: Functions cannot be imported by browsing.

1 In the Datastores tab of the Local object library, select the datastore you want to use.

2 Right-click, and select Open.The items available to import through the datastore appear in the workspace.In some environments, the tables are organized and displayed as a tree structure. If this is true, there is a plus sign (+) to the left of the name. Click the plus sign to navigate the structure.

3 Select the items for which you want to import metadata.For example, to import a table, you must select a table rather than a folder that contains tables.

4 Right-click, and select Import.

5 In the Local object library, go to the Datastores tab to display the list of imported objects.

Note: The workspace contains columns that indicate whether the table has already been imported into Data Integrator (Imported) and if the table schema has changed since it was imported (Changed). To verify whether the repository contains the most recent metadata for an object, right-click the object, and select Reconcile. If there are changes to the most recent metadata, you also need to re-import the metadata to pull the changes into the datastore definitions.

Importing metadata by browsing


Activity: Creating an operational datastore (ODS) and importing table metadata for the source table

ObjectiveIn this activity you will: • Create an ODS datastore to connect to your source database • Import metadata for individual source tables using the datastore

Instructions

1 In your Local object library, click to access the Datastores tab. 2 Right-click the Datastores tab area, and select New.3 In the Create New Datastore window, enter:

• Datastore name: ODS_DS• Datastore type: Leave as database• Database type: select Microsoft SQL Server from the drop-down list• Database server name: localhost; or your own machine name if

different• Database name: ODS• User name: ODSuser• Password: No password required

4 Click OK.5 Go back to the Datastores tab, and double-click ODS_DS.

The names of all the tables in the database defined by the datastore display.

6 Resize the metadata column inside the workspace, if necessary.7 In the workspace, select each table listed below, and right-click each

table to import the metadata:

Practice


• DBO.ODS_CUSTOMER• DBO.ODS_DELIVERY• DBO.ODS_EMPLOYEE• DBO.ODS_MATERIAL• DBO.ODS_REGION• DBO.ODS_SALESITEM• DBO.ODS_SALESORDER

Note: A value of Yes appears under the Imported column.8 In the Local object library, expand ODS_DS, and expand Tables.

The imported metadata tables appear as shown:


Y

Activity: Creating a target datastore and importing metadata for target tables

ObjectiveIn this activity you will:• Create a target datastore to connect to your target database • Import metadata for target tables using the datastore

Instructions1 In your Local library object Datastores tab, create a new datastore.2 In the Create New Datastore window, enter:

• Datastore name: Target_DS• Datastore type: Leave as database• Database type: select Microsoft SQL Server from the drop-down list• Database server name: localhost; or your own machine name if

different• Database name: Target• User name: targetuser• Password: No password required

3 In the Datastores tab, double-click Target_DS.4 Right-click each of the tables listed to import the metadata for each table

into the local repository:• DBO.CDC_TIME• DBO.CUST_DIM• DBO.EMPLOYEE_DIM• DBO.MTRL_DIM• DBO.SALES_FACT• DBO.SALESORG_DIM• DBO.STATUS_TABLE• DBO.TIME_DIM

Note: A value of Yes appears under the Imported column.5 In the Local object library, expand ODS_DS, and expand Tables.

Practice


The imported metadata tables appear as shown.


Defining a file format

After completing this unit, you will be able to:• Explain what a file format is• Create a new file format• Handle errors in file formats

A file format is a set of properties describing the structure of a flat file (ASCII) or a metadata structure. While a format describes a specific file, whereas a file format template is a generic description that can be used for many data files.

Data Integrator can use data stored in files for data sources and targets. A file format defines a connection to a file. Therefore, you use a file format to connect Data Integrator to source or target data when the data is stored in a file rather than a database table. The object library stores file format templates that you use to define specific file formats as sources and targets in data flows.

When working with file formats, you must:• Create a file format template that defines the structure for a file.• Create a specific source or target file format in a data flow.

The source or target file format is based on a template and specifies connection information such as the file name.

File format objects can describe files in:• Delimited format — delimiter characters such as commas or tabs

separate each field where the max length is 1000 characters.• Fixed width format — the column width is specified by the user. Leading

or trailing blank padding is added where fields diverge from specified width. Max length is 1000 characters.

• SAP R/3 format — used with predefined Transport_Format or with a custom SAP R/3 format.

Introduction

Explaining what a file format is


Use the file format editor to set properties for file format templates and source and target file formats. Available properties vary by the mode of the file format editor:• New mode: use this to create a new file format template.• Edit mode: use this to edit an existing file format template. • Source mode: use this edit the file format of a particular source file.• Target mode: use this to edit the file format of a particular target file.

The file format editor has three work areas:• Properties-Values: use this to edit the values for file format properties.

Expand and collapse the property groups by clicking the leading plus or minus.

• Column Attributes: use this to edit and define the columns or fields in the file. Field-specific formats override the default format set in the Properties-Values area.

• Data Preview: use this to view how the settings affect sample data.

The file format editor contains frames to resize the window and all the work areas. You can expand the file format editor to the full screen size.

The properties and appearance of the work areas vary with the format of the file.

Creating file formats


Property ValuesThe Property-Value work area in the file format editor lists file format properties that are not field-specific. These options are filtered by the mode you are using.

The Column Attributes work area in the file format editor contains properties about the fields in the file format.

The Group File Read can read multiple flat files with identical formats through a single file format. By substituting a wild card character or list of file names for the single file name, multiple files can be read.

Date FormatsDate formats can override default date format for file at the field level. Use any combination of the codes listed in the Data Integrator Reference Guide

To create a new file format1 In the Local object library, select the Formats tab, right-click Flat Files,

and select New.


The file format editor opens.

2 In Type, specify the file type:• Delimited: select this file type if the file uses a character sequence to

separate columns.• Fixed width: select this file type if the file uses specified widths for

each column. Note: Data Integrator represents column sizes (field-size) in number of

characters for all sources except fixed-width file formats, which it always represents in bytes. Consequently, if a fixed-width, file format uses a multi-byte code page, then no data is displayed in the data preview section of the file format editor for its files. For more information about multi-byte support, see Chapter 9, “Locales and Multi-Byte Functionality,” in the Data Integrator Reference Guide.

3 In the Name field, type a name that describes this file format template.After you save this file format template, you cannot change the name.Note: In the Custom Transfer Program field, select YES if you want to

read and load files using a third-party file-transfer program.4 Complete the other properties to describe files that this template

represents.Note: Properties vary by file type. Look for properties available when the

file format editor is in new mode.5 For source files, specify the structure of the columns in the Column

Attributes work area:• Enter field name.• Set data types.• Enter field lengths for VarChar data types.• Enter scale and precision information for Numeric and Decimal data

types.


• Enter Format field information for appropriate data types, if desired. This information overrides the default format set in the Properties-Values area for that data type.

Note: You do not need to specify columns for files used as targets. If you do specify columns and they do not match the output schema from the preceding transform, Data Integrator writes to the target file using the transform’s output schema. For a decimal or real data type, if you only specify a source column format, and the column names and data types in the target schema do not match those in the source schema, Data Integrator cannot use the source column format specified. Instead, it defaults to the format used by the code page on the computer where the Job Server is installed.

6 Click Save & Close to save the file format and close the file format editor.Tip: To make sure your file format definition works properly, it is important

to finish inputting the values for the file properties before moving on to the Column Attributes work area.

To create a file format from an existing file format1 In the Formats tab of the object library, right-click an existing file format

and select Replicate.The File Format Editor opens, displaying the schema of the copied file format.

2 Under General, in the Name field, type a new, unique name for the replicated file format.Note: You must enter a new name for the replicated file. Data Integrator

does not allow you to save the replicated file with the same name as the original (or any other existing File Format object). Also, this is your only opportunity to modify the Name property value. After it is saved, you cannot modify the name again.

3 Edit other properties as desired.4 Look for properties available when the file format editor is in new mode.5 Click Save.

Tip: To terminate the replication process (even after you have changed the name and clicked Save), click Cancel or press the Esc button on your keyboard.


To read multiple flat files with identical formats through a single file formatYou can read multiple flat files with identical formats by using a wildcard to refer to the file names of the multiple formats that you want to read.

To use the wildcard, you first have to define the file format based on one single file that shares the same schema as the other files and then edit the defined file to reference the wildcard.1 Open the format wizard. 2 In the location field of the format wizard, enter one of the following:

• Root directory (optional to avoid retyping)• List of file names, separated by commas• File name containing a wild character (*)Note: when you use the (*) to call the name of several file formats, Data

Integrator reads one file format, closes it and then proceeds to read the next one. For example, if you specify the file name revenue*.txt, Data Integrator will read all flat files containing revenue in the file name.

To create a specific source or target file using a file format template1 Select a flat file format template on the Formats tab of the object library.2 Drag the file format template to the data flow workspace.3 Select Make Source to define a source file format, or select Make Target

to define a target file format.4 Click the name of the file format object in the workspace to open the file

format editor.5 Enter the properties specific to the source or target file.

For a description of available properties, see “Data Integrator Objects”, Chapter 2 in the Data Integrator Reference Guide. Look for properties available when the file format editor is in source mode or target mode.

6 Under File(s), specify the file name and location in the File and Location properties.

7 Connect the file format object to other objects in the data flow as appropriate.


When you enable error handling in the File Format Editor, Data Integrator:• Checks for the two types of flat-file source error:

• Data-type conversion errors: for example, a field might be defined in the File Format Editor as having a data type of integer but the data encountered is actually varchar.

• Row-format errors: for example, in the case of a fixed-width file, Data Integrator identifies a row that does not match the expected width value.

• Stops processing the source file after reaching a specified number of invalid rows

• Logs data-type conversion or row-format warnings to the Data Integrator error log. You can limit the number of warnings to log without stopping the job.

The error file format is a semicolon-delimited text file that is written to a .txt file that you create and resides on the same computer as the Job Server.Tip: You can create this file in your Windows temp directory and call it

file_errors.txt.

Entries in an error file have this syntax:source file path and name; row number in source file; Data Integrator error; column number where the error occurred; all columns from the invalid row

This is an example of a row format error:d:/acl_work/in_test.txt;2;-80104: 1-3-A column delimiter was seen after column number <3> for row number <2> in file <d:/acl_work/in_test.txt>. The total number of columns defined is <3>, so a row delimiter should be seen after column number <3>. Please check the file for bad data, or redefine the input schema for the file by editing the file format in the UI.;3;defg;234;def

To enable flat file error handling in the File Format Editor1 In the object library, click the Formats tab.2 Expand Flat Files, and right-click to open the format file.3 Under the Error handling section, click the drop-down arrow beside

these options to enable each error handling feature:• Capture data conversion errors• Capture row format errors• Write error rows to file

Handling errors in file formats


Note: You can also specify the maximum warnings to log and the maximum errors before a job is stopped.

4 In the Error file root directory field, click the folder icon to browse to the directory in which you have stored the error handling .txt file you created.

5 In the Error file name field, type the name for the .txt file that you created to capture the flat file error logs.


Activity: Creating a file format

ObjectivesIn this activity you will:• Create a file format that will be used as a source to load data from a flat

file in a later activity• Define this file format based on identified business requirements, such as

Date column attributes

Instructions1 In the object library, on the Formats tab, right-click Flat Files and select

New.The File format editor displays.

2 Under General, leave Type as Delimited and change the Name to Format_SalesOrg.

3 Under Data Files, in the File Name(s), click the folder icon and navigate in your Data Integrator install directory to \Tutorial Files.

4 Select Sales_org.txt.5 Click Open.6 Click Yes to overwrite the current schema.

The Column Attributes and Data Preview area of the File format editor should populate with the sales_org.txt source data values.

7 Under Input/Output, change Skip row header field value to Yes. 8 Click Yes to overwrite the current schema again.

This indicates that the first row contains the name of the column headings. Data Integrator uses these values for the field names.

9 Under Default Format, in the Date field change the format to ddmmyyyy. The source data contains dates with a 2-digit day number

Practice


followed by a 2-digit month number, a 4-digit year (ddmmyyyy), and no time value.

10 For the DateOpen field name, click Data Type in the Column Attributes area, and change the data type to Date. The field types and lengths should look like this:

11 Click Save and Close.

Field Name Data Type Field Size

SalesOffice Int

Region VarChar 2

DateOpen Date

Country VarChar 7


Lesson Summary

Quiz: Defining Source and Target Metadata1 What is the difference between a datastore and a database?

2 List five kinds of metadata.

3 What are the two methods in which metadata can be manipulated in Data Integrator objects? What does each of these do?

4 Which of the following is NOT a datastore type?• Database• File Format• Application• Adapter

After completing this lesson, you are now able to:• Explain what a datastore is• Create a database datastore• Change a datastore definition• List types of metadata• Capture metadata information from imported data• Import metadata by browsing• Explain what a file format is• Create a new file format• Handle errors in file formats

Review

Summary


Lesson 4Creating a Batch Job

Thus far, we have created datastores for importing metadata. Now that the metadata has been imported, we can create a simple data flow for defining data movement requirements. Before creating the data flow, we must create a project that contains a job.

In this lesson you will learn about:• Creating a batch job• Creating a simple data flow• Using the Query transform• Understanding the Target Table editor• Executing the job• Adding a new table to the target using template tables

Duration: 1.5 hours


Creating a batch job

After completing this unit, you will be able to:• Create a project• Create a job• Add, connect, and delete objects in the workspace• Create a work flow

A project is a reusable object that allows you to group jobs. A project is the highest level of organization offered by Data Integrator. Opening a project makes one group of objects easily accessible in the user interface.

You can use a project to group jobs that have schedules that depend on one another or that you want to monitor together.

Projects have common characteristics:• Projects are listed in the object library• Only one project can be open at a time• Projects cannot be shared among multiple users

Objects that make up a projectThe objects in a project appear hierarchically in the project area. If a plus sign (+) appears next to an object, expand it to view the lower-level objects contained in the object. Data Integrator shows you the contents as both names in the project area hierarchy and icons in the workspace.

Projects are made up of jobs, work flows (optional), and data flows.

In this example, the Job_KeyGen job contains two data flows, and the DF_EmpMap data flow contains multiple objects.

Introduction

Creating a project

Creating a Batch Job—Learner’s Guide 4-3

Each item selected in the project area also displays in the workspace:

To create a new project1 From the main toolbar, select Project > New > Project.2 Enter the name of your new project.

The name can include alphanumeric characters and underscores (_). It cannot contain blank spaces.

3 Click Create.The new project appears in the project area. As you add jobs and other lower-level objects to the project, they also appear in the project area.

To open an existing project1 From the main toolbar, select Project > Open.

Select the name of an existing project from the list.2 Click Open.

Note: If another project is already open, Data Integrator closes that project and opens the new one.

To save a project1 From the main toolbar, select Project > Save All.

Data Integrator lists the jobs, work flows, and data flows that you edited since the last save.Tip: Deselect any listed object to avoid saving it.

2 Click OK.Note: You are prompted to save all changes made in a job when you execute

the job or exit the Designer.Saving a reusable object saves any single-use object included in it.


A job is the only object you can execute. You can manually execute and test jobs in development. In production, you can schedule batch jobs and set up real-time jobs as services that execute a process when Data Integrator receives a message request.

A job is made up of steps are executed together. Each step is represented by an object icon that you place in the workspace to create a job diagram. A job diagram is made up of two or more objects connected together. You can include any of the following objects in a job definition:• Data flows

• Sources• Targets• Transforms

• Work flows• Scripts• Conditionals• While loops• Try/catch blocks

If a job becomes complex, organize its content into individual work flows, and then create a single job that calls those work flows.

Naming conventions for objects in jobsIt is recommend that you follow consistent naming conventions to facilitate object identification across all systems in your enterprise. This allows you to work more easily with metadata across all applications such as:• Data-modeling applications• ETL applications• Reporting applications• Adapter software development kits

Creating a job


To create a job in the project area1 In the project area, select the project name.2 Right-click and select New Batch Job.

3 Edit the name.The name can include alphanumeric characters and underscores (_). It cannot contain blank spaces.Data Integrator opens a new workspace for you to define the job.Note: You can also create a job and related objects (work flows and data

flows) from the Local object library. When you create a job in the Local object library, you must associate the job and all related objects to a project before you can execute the job. Job execution is covered in more detail later in this lesson.

After creating a job, you can add objects to the job workspace area using either the Designer Local object library or the Tool palette.

To add objects from the Local object library to the workspace1 From the Local object library, click the tab for the type of object you want

to add. 2 Click and drag the selected object into the workspace.

To add objects from the Tool palette to the workspace• From the Tool palette, click the desired object, and click the workspace

again to add it.

To connect objects in the workspace area1 Place the objects you want to connect in the workspace.2 Click and drag from the triangle or square of an object to the triangle or

square of the next object in the flow to connect the objects.

To disconnect objects in the workspace area1 Select the connecting line between the objects you want to disconnect.2 On your keyboard, press Delete to disconnect the objects.

Adding, connecting, and deleting objects in the workspace

WF_TimwDlm WF_Dimensions


A work flow is an optional object that defines the decision-making process for executing data flows (a work flow may contain a data flow). For example, elements in a work flow can determine the path of execution based on a value set by a previous job or can indicate an alternative path if something goes wrong in the primary path. Ultimately, the purpose of a work flow is to prepare for executing data flows and to set the state of the system after the data flows are complete.

Jobs are special work flows. Jobs are special because you can execute them. Almost all of the features documented for work flows also apply to jobs, with the exception that jobs do not have parameters but jobs do support global variables. Some of the functionality of parameters is available with global variables. Global variables are discussed later in Lesson 9 “Using Data Integrator Scripting Language”.

Steps in a work flowWork flow steps take the form of icons that you place in the work space to create a work flow diagram. These objects can be elements in work flows:• Work flows• Data flows• Conditionals• While loops • Try/catch blocks• Scripts

Work flows can call other work flows, and you can nest calls to any depth. A work flow can also call itself. The connections you make between the icons in the workspace determine the order in which work flows execute, unless the jobs containing those work flows execute in parallel.

Creating a work flow


Order of execution in work flowsSteps in a work flow execute in a sequence from left to right, as indicated by the lines connecting the steps. Here is the diagram for a work flow that calls three data flows:

Note that DataFlow1 has no connection from the left but is connected on the right to the left edge of DataFlow2 and that DataFlow2 is connected to DataFlow3. There is a single thread of control connecting all three steps. Execution begins with DataFlow1 and continues through the three data flows.

Connect steps in a work flow when there is a dependency between the steps. If there is no dependency, the steps need not be connected. In that case, Data Integrator can execute the independent steps in the work flow as separate processes. In the following work flow, Data Integrator executes data flows 1 through 3 in parallel:

To execute more complex work flows in parallel, define each sequence as a separate work flow, and then call each of the work flows from another work flow as in this example:

Define Work flow A

Define Work flow B

Call work flows A and B from work flow C

You can specify a job to execute a particular work flow or data flow only one time. In that case, Data Integrator only executes the first occurrence of the work flow or data flow; Data Integrator skips subsequent occurrences in the job. You might use this feature when developing complex jobs with multiple paths, such as jobs with try/catch blocks or conditionals, and you want to ensure that Data Integrator only executes a particular work flow or data flow one time.


ConditionalsConditionals are single-use objects used to implement if/then/else logic in a work flow. Conditionals and their components (If expressions, Then and Else diagrams) are included in the scope of the parent control flow’s variables and parameters.

When you define a conditional, you must specify a condition and two logical branches:

A conditional can fit in a work flow.

For example, you are using a Windows command file to transfer data from a legacy system into Data Integrator. You write a script in a work flow to run the command file and return a success flag. You then define a conditional that reads the success flag to determine if the data is available for the rest of the work flow.

To implement this conditional in Data Integrator, you define two work flows, one for each branch of the conditional. If the elements in each branch are simple, you can define them in the conditional editor.

The definition of the conditional shows these two branches:

Both the Then and Else branches of the conditional can contain any object that you can have in a work flow, including other work flows, nested conditionals, try/catch blocks, and so on.

If A Boolean expression that evaluates to TRUE or FALSE. You can use functions, variables, and standard operators to construct the expression.

Then Work flow elements to execute if the IF expression evalu-ates to TRUE.

Else (Optional) Work flow elements to execute if the IF expres-sion evaluates to FALSE.


While loopsThe while loop is a single-use object that you can use in a work flow. While loops repeat a sequence of steps in a work flow as long as a condition is true.

Typically, the steps done during the while loop result in a change in the condition so that the condition is eventually no longer satisfied and the work flow exits from the while loop. If the condition does not change, the while loop does not end.

Try/catch blocksTry and catch objects are single-use objects. A try/catch block is a combination of one try object and one or more catch objects that allow you to specify alternative work flows if errors occur while Data Integrator is executing a job. Try/catch blocks:• Catch classes of exceptions thrown by Data Integrator, the DBMS, or the

operating system• Apply solutions that you provide• Continue execution

The general method to implement exception handling is:• Insert a try object before the steps for which you are handling errors• Insert one or more catches in the work flow after the steps

In each catch, do this:• Indicate the error or group of errors that you want to catch• Define the work flows that a thrown exception executes

If an exception is thrown during the execution of a try/catch block, and if no catch is looking for that exception, then the exception is handled by normal error logic.

This work flow shows a try/catch block surrounding a data flow:

In this case, if the data flow BuildTable causes any system-generated exceptions handled by the catch, then the work flow defined in the catch executes.

The action initiated by the catch can be simple or complex. Some examples of possible exception actions are:• Send a prepared e-mail message to a system administrator• Rerun a failed work flow or data flow• Run a scaled-down version of a failed work flow or data flow


ScriptsScripts are single-use objects used to call functions and assign values to variables in a work flow.

For example, you can use the SQL function in a script to determine the most recent update time for a table and then assign that value to a variable. You can then assign the variable to a parameter that passes into a data flow and identifies the rows to extract from a source.

A script can contain these statements:• Function calls• If statements• While statements• Assignment statements• Operators

The basic rules for the syntax of the script are:• Each line ends with a semicolon (;).• Variable names start with a dollar sign ($).• String values are enclosed in single quotation marks (').• Comments start with a pound sign (#).• Function calls always specify parameters even if the function uses no

parameters.

For example, the following script statement determines today’s date and assigns the value to the variable $TODAY:

$TODAY = sysdate();

Note: You cannot use variables unless you declare them in the work flow that calls the script.

Example of a work flowSuppose you want to update a fact table. You define a data flow in which the actual data transformation takes place. However, before you move data from the source, you want to determine when the fact table was last updated so that you only extract rows that have been added or changed since that date.

You need to write a script to determine when the last update was made. You can then pass this date to the data flow as a parameter.

In addition, you want to check that the data connections required to build the fact table are active when data is read from them. To do this in Data Integrator, you define a try/catch block. If the connections are not active, the catch runs a script you wrote, which automatically sends mail notifying an administrator of the problem.

Scripts and error detection cannot execute in the data flow. Rather, they are steps of a decision-making process that influences the data flow. This decision-making process is defined as a work flow as in the example below. Data Integrator executes these steps in the order that you connect them.


To create a new work flow using the tool palette

1 Select the work flow icon in the tool palette .2 Click where you want to place the work flow in the job diagram.3 Add the data flows, work flows, conditionals, try/catch blocks, and scripts

that you need.Note: If more than one instance of a work flow appears in a job, you can

improve execution performance by running the work flow only one time.

Tips for creating jobs, work flows and data flows

When you create jobs, work flows and data flows, it is important that you:• Use consistent naming conventions between jobs, work flows and data

flows such as Name_Job or Job_Name.• Define the job flow using a step by step approach: create a job first, add a

work flow to the job, then add the data flow last.• Name work flows in order of sequence.

For example: 1_LoadDim_WF > 2_LoadFact_WF > 3_Load_Aggr_WF.


Creating a simple data flow

After completing this unit, you will be able to:• Create a data flow• Explain source and target objects• Add source and target objects to a data flow

Data flows extract, transform, and load data. Everything having to do with data, including reading sources, transforming data, and loading targets, occurs inside a data flow. The lines connecting objects in a data flow represent the flow of data through data transformation steps.

After you define a data flow, you can add it to a job or work flow. When you drag a data flow icon into a job, you are telling Data Integrator to validate these objects according to the requirements of the job type. From inside a work flow, a data flow can send and receive information to and from other objects through input and output parameters.

Naming data flowsData flow names can include alphanumeric characters and underscores (_). They cannot contain blank spaces.

Steps in a data flowEach icon you place in the data flow diagram becomes a step in the data flow. The objects that you can use as steps in a data flow are:• Source and target objects• Transforms

The connections you make between the icons determine the order in which Data Integrator completes the steps.

Introduction

Creating a data flow


Data flows as steps in work flowsData flows are closed operations, even when they are steps in a work flow. Any data set created within a data flow is not available to other steps in the work flow.

A work flow does not operate on data sets and cannot provide more data to a data flow; however, a work flow can:• Call data flows to perform data movement operations• Define the conditions appropriate to run data flows• Pass parameters to and from data flows

Intermediate data sets in a data flowEach step in a data flow, up to the target definition, produces an intermediate result, for example, the results of a SQL statement containing a WHERE clause, which flows to the next step in the data flow. The intermediate result consists of a set of rows from the previous operation and the schema in which the rows are arranged. This result is called a data set. This data set may, in turn, be further filtered and directed into yet another data set.

Data flow exampleYou want to populate the fact table in your data warehouse with new data from two tables in your source transaction database.

Your data flow consists of:• Two source tables• A join between these tables, defined in a query transform• A target table where the new rows are placed

You indicate the flow of data through these components by connecting them in the order that data moves through them.


To create a new data flow1 Open a job or work flow in the workspace in which you want to define a

data flow.

2 Select the data flow icon in the tool palette . 3 Click the workspace to add the data flow to the job or work flow.4 Double-click the data flow icon in the workspace and type a name for

your data flow.5 In the data flow workspace, add the sources, transforms, and targets.

A data flow directly reads and loads data using two types of objects:• Source objects: define sources from which to read data• Target objects: define targets to which to write (or load) data

Source objectsSource objects represent data sources read from data flows.

Note: Some transforms also act as source objects. Transforms are discussed briefly in the next section and in depth-in Lesson 6 “Using Built-in Transforms” of this course.

Explaining source and target objects

Source object Description

Table A file formatted with columns and rows as used in relational databases.

Template table A template table that has been created and saved in another data flow (used in development).

File A delimited or fixed-width flat file.

Document A file with an application-specific format (not read-able by SQL or XML parser).

XML file A file formatted with XML tags.

XML message Used as a source in real-time jobs.


Target objectsTarget objects represent data targets that can be written to in data flows.

Prerequisites for using source or target objects in a data flowFulfill these prerequisites before using a source or target object in a data flow:

Target object Description

Table A file formatted with columns and rows as used in relational databases.

Template table A table whose format is based on the output of the preceding transform (used in development).

File A delimited or fixed-width flat file.

Document A file with an application-specific format (not readable by SQL or XML parser).

XML file A file formatted with XML tags.

XML template file An XML file whose format is based on the pre-ceding transform output (used in development, primarily for debugging data flows).

For Prerequisite

Tables accessed directly from a database

Define a custom datastore and import table metadata.

Template tables Define a custom datastore.

Files Define a file format and import the file.

XML files and messages Import an XML file format.


To add a source or target object to a data flow1 In the object library, open the data flow in which you want to place the

object.2 Select the appropriate object library tab:

• Formats tab for flat files, DTDs, or XML Schemas• Datastores tab for database and adapter objects

3 Select the object you want to add as a source or target• For a new template table, select the Template Table icon from the

tool palette• For a new XML template file, select the Template XML icon from the

tool palette4 Drag the object and drop the object in the workspace.

Note: For objects that can be either sources or targets, when you release the cursor, a pop-up menu appears. Select the kind of object to make.

The source or target object appears in the workspace.5 Click the object name in the workspace

Data Integrator opens the editor for the object. Set the options you require for the object.

Adding source and target objects to a data flow


Using the Query transform

After completing this unit, you will be able to:• Explain what a transform is• Understand the Query transform• Describe the Query editor window• Use the Query transform in a data flow

Data Integrator includes objects called transforms. Transforms operate on data sets. Transforms manipulate input sets and produce one or more output set.

Often times transforms are used in combination to create the output data set. For example, the Table Comparison, History Preserve and Key Generation transforms are often used for slowly changing dimensions.

Sometimes transforms, such as the Date_Generation and SQL transform, can also be used as source objects.

These transforms are available from the Local object library on the Transforms tab. The following table provides a brief description of the available, built-in Data Integrator transforms. More detailed information on each transform is found in Lesson 6 “Using Built-in Transforms” of this course.

Introduction

Explaining what a transform is

Transform Description

Query Retrieves a data set that satisfies conditions that you specify. A query transform is similar to a SQL SELECT statement.

Case The Case transform simplifies branch logic in data flows by consolidating case or decision making logic in one transform. Paths are defined in an expres-sion table.

Hierarchy_Flattening Flattens hierarchical data into relational tables so that it can participate in a star schema. Hierarchy flattening can be both vertical and horizontal.

Merge Unifies rows from two or more sources into a single target.

Pivot (Columns to Rows)

Rotates the values in specified columns to rows. (Also see Reverse Pivot.)

Reverse Pivot (Rows to Columns)

Rotates the values in specified rows to columns.

SQL Performs the indicated SQL query operation.

Row_Generation Generates a column filled with int values starting at zero and incrementing by one to the end value you specify.


Data Integrator also maintains operation codes that describe the status of each row in each data set. Operation codes are described by the inputs to and outputs from objects in data flows. You can use operation codes with transforms to indicate how each row in the data set is applied to a target table.

Date_Generation Generates a column filled with date values based on the start and end dates and increment that you provide.

Map_Operation Allows conversions between operation codes.

Map_CDC_Operation While commonly used to support Oracle or main-frame changed-data capture, this transform sup-ports any data stream if its input requirements are met. Sorts input data, maps output data, and resolves before - and after- images for UPDATE rows.

Effective_Date Generates an additional effective to column based on the primary key’s effective date.

Validation Transform Qualifies a data set based on rules for input schema columns. It filters out or replaces data that fails your criteria. The available outputs are pass and fail. You can have one validation rule per column.

Table_Comparison Compares two data sets and produces the differ-ence between them as a data set with rows flagged as INSERT and UPDATE.

History_Preserving Converts rows flagged as UPDATE to UPDATE plus INSERT, so that the original values are preserved in the target. You specify the column in which to look for updated data.

Key_Generation Generates new keys for source data, starting from a value based on existing keys in the table you spec-ify.

XML_Pipeline Processes large XML inputs in small batches.

Address_Enhancement Uses Address Correction and Encoding (ACE) to break an address into component parts that you can correct, complete, and standardize.Note: This transform is only available if you have Firstlogic installed.

Match_Merge Finds and organizes matching, duplicate, and unique database records.Note: This transform is only available if you have Firstlogic installed.

Name_Parsing Organizes inconsistent name data into a clean, uni-form format.Note: This transform is only available if you have Firstlogic installed.

Transform Description


This section provides an overview of the Query transform, the most commonly used transform.

The Query transform retrieves a data set that satisfies conditions that you specify. This transform is similar to a SQL SELECT statement.

The Query transform can perform the following operations:• Choose (filter) the data to extract from sources• Join data from multiple sources• Map columns from input to output schemas• Perform transformations and functions on the data• Perform data nesting and unnesting • Add new columns, nested schemas, and function results to the output

schema• Assign primary keys to output columns.

The next section gives a brief description the function, data input requirements, options, and data output results for the Query transform. For more information on the Query transform see “Transforms” Chapter 5 in the Data Integrator Reference Guide.

Data Input

The data input is a data set from one or more sources with rows flagged with a Normal operation code.

Operation codes are described by the inputs to and outputs from objects in data flows. You can use operation codes with transforms to indicate how each row in the data set is applied to a target table.

The NORMAL operation code creates a new row in the target. All the rows in a data set are flagged as NORMAL when they are extracted by a source table or file. If a row is flagged as NORMAL when loaded into a target table or file, it is inserted as a new row in the target.

Options

The Query transform offers several options:• Input schema: the input schema area displays all schemas input to the

Query transform as a hierarchical tree. Each input schema can contain zero or more of these elements:

• Columns• Nested schemas

Icons preceding columns are combinations of these graphics:

Understanding the Query transform


• Output schema —the output schema area displays the schema output from the Query transform as a hierarchical tree. The output schema can contain one or more of these elements:• Columns• Nested schemas• Functions

Icons preceding columns are combinations of these graphics:

Data Output

The data output is a data set based on the conditions you specify and using the schema specified in the output schema area.


The query editor, a graphical interface for performing query operations, contains these areas:• Input schema area • Output schema area • Parameters area

The i icon indicates tabs containing user-defined entries.

The input and output schema areas can contain:• Columns• Nested schemas• Functions (output only)

The Schema In and Schema Out lists display the currently selected schema in each area. The currently selected output schema is called the current schema and determines:• The output elements that can be modified (added, mapped, or deleted)• The scope of the Select through Order by tabs in the parameters area

The current schema is highlighted while all other (non-current) output schemas are gray.

Describing the Query editor window


The Options area includes the:• Mapping tab• Select tab• From tab• Outer Join tab• Group By tab• Order By tab• Search/Replace tab

Use these tabs to specify the data you want retrieved. Specifying information on these tabs is similar to specifying a SQL SELECT statement.

To add a Query transform to a data flow1 Double-click the data flow icon to open the data flow workspace.2 From the Transforms tab in the Local object library, drag the Query

transform onto the data flow workspace.3 Connect the Query transform to your source and target tables in the

workspace.4 Double-click the Query transform to open the Query Editor.5 Drag the items from the Schema In to the Schema Out area to connect

the Query to inputs and outputs:• The inputs for a Query can include the output from another transform

or the output from a source.• The outputs from a Query can include input to another transform or

input to a target.Note: If you connect a target table to a Query with an empty output

schema, Data Integrator automatically fills the Query’s output schema with the columns from the target table, without mappings.

Using the Query transform in a data flow


Understanding the Target Table editor

The target table loader offers different tabs where you can set database type properties, different table loading options, and use different tuning techniques for loading a job.

The performance tuning techniques offered in Data Integrator are more advanced and outside the scope of this course. For the purpose of this course, we focus on using the table loader to set database type properties and different table loading options. For more information on more advance tuning techniques for the table loader, see “Tuning Techniques” in Chapter 3 of the Data Integrator Performance Optimization Guide.

After completing this unit, you will be able to:• Set the database properties of the target table in the Target tab• Describe different table loading options

The Database type option in the Target table Editor allows you to select a database type to set the properties for your target table.

To access the Target Table Editor1 In a data flow, double-click the target table.

The Target Table Editor opens and displays the Target, Options, Bulk Loader Options, Load Triggers, Pre-Load, and Post-Load Commands tabs.

The Target tab is populated from the Datastore Editor with datastore configurations. Datastore configurations are covered in Lesson 12 “Migrating Projects between Design, Test, and Production Phases” of this course.

Different database types and their version numbers are listed. Note: You can add or remove the datastore configurations in the Datastore

Editor to add or remove items from this list.

Introduction

Setting the database properties in the Target tab


The content of individual tabs on the Target Table Editor change to reflect the specific properties for that database type when you select a different database type.

The Database type option also allows new database versions to inherit target table settings. The first time you open the Target Table Editor for a specific instance of a table, current values for that database type appear.

If you define a new datastore configuration with a different database version, then that version's initial settings are inherited from the earlier version. If there is no earlier version, then Data Integrator uses the settings. The goal of this behavior is to make it easier to modify your jobs to take advantage of later versions of supported database systems.

Options tabYou can set different table loading options in the Target Table editor Options tab. Available options depend on the database in which the table is defined.

The following table loading options are available for all databases in Data Integrator.

Describing table loading options

Option Description

Rows per commit This sets the number of rows sent to the target database in one fetch process.

This is considered to be a more advanced option.See the Data Integrator Performance Optimization Guide and “Description of objects” in the Data Integrator Reference Guide for more information.

Column comparison This specifies how the input columns are mapped to output columns. There are two options:

compare_by_position — Data Integrator disre-gards the column names and maps source col-umns to target columns by position.

compare_by_name — Data Integrator maps source columns to target columns by name.Validation errors occur if the data types of the col-umns do not match.

Delete data from table before loading

Used for batch jobs, sends a TRUNCATE state-ment to clear the contents of the table before load-ing. Defaults to not selected.


Enable Partitioning (Is displayed only if the target table data is parti-tioned).

This loads data using the number of partitions in the table as the maximum number of parallel instances. You can only select one of the following loader options: Number of loaders, Enable parti-tioning, or transactional LoadingNote: If you select both Enable Partitioning and Include in transaction, the Include in transaction setting overrides the Enable partitioning option. For example, if your job is designed to load to a partitioned table, but you select Include in transac-tion and enter a value for Transaction order, when the job is executed, Data Integrator includes the table in a transaction load and does not parallel load to the partitioned table.

This is considered to be a more advanced option.See the Data Integrator Performance Optimization Guide and the “Description of objects” in the Data Integrator Reference Guide for more information.

Number of loaders Loading with one loader is known as single loader loading.

The term parallel loading refers to loading jobs that contain a number of loaders greater than one. The default number of loaders for this option is one. The maximum number of loaders is five.

When parallel loading, each loader receives the number of rows indicated in the Rows per commit option in turn, and applies the rows in parallel with the other loaders.

For example, if you choose a Rows per commit of 1000 and set the number of loaders to three, the first 1000 rows are sent to the first loader. The second 1000 rows are sent to the second loader, the third 1000 rows to the third loader, and the next 1000 rows back to the first loader.


Option Description


Use overflow file This option is used for recovery purposes. If a row cannot be loaded, it is written to a file. When this option is selected, options are enabled for the file name and file format. The overflow format can include the data rejected and the operation being performed (write_data) or the SQL command used to produce the rejected operation (write_sql).

Ignore columns withvalue

Enter a value that might appear in a source col-umn and that you do not want updated in the tar-get table. When this value appears in the source column, the corresponding target column is not updated during auto correct loading. You can enter spaces.Note: Use this feature with Auto correct load to avoid overwriting fields in the target with nulls or other specified values.

Used with the Auto-correct load option. This is considered to be a more advanced option.See the Data Integrator Performance Optimization Guide and “Description of objects” in the Data Integrator Reference Guide for more information.

Ignore columns withnull

Select this check box if you do not want NULL source columns updated in the target table during auto correct loading.This option is only available when you select the Auto correct load check box.Note: Use this feature with Auto correct load to avoid overwriting fields in the target with nulls or other specified values.

Used with the Auto-correct load option. This is considered to be a more advanced option.See the Data Integrator Performance Optimization Guide and “Description of objects” in the Data Integrator Reference Guide for more information.

Option Description


Use input keys If the target table contains no primary key, this option enables Data Integrator to use the primary keys from the input.

If the Use input keys check box is selected, Data Integrator uses the primary key of the source table. Otherwise, Data Integrator uses the primary key of the target table; if the target table has no primary key, Data Integrator considers the primary key to be all the columns in the target.


Update key columns If you select this option, Data Integrator updates key column values when it loads data to the target.

Auto correct load Ensures that the same row is not duplicated in a target table. This is particularly useful for data recovery operations.

When Auto correct load is selected, Data Integra-tor reads a row from the source and checks if a row exists in the target table with the same values in the primary key.

If a matching row does not exist, it inserts the new row regardless of other options. If a matching row exists, it updates the row depending on the values of Ignore columns with value and Ignore columns with null.

Option Description


Note: Additional available loading options depend on the database in which the table is defined.

Include in transaction Indicates that this target is included in the transac-tion processed by a batch or real-time job. This option allows you to commit data to multiple tables as part of the same transaction. If loading fails for any one of the tables, no data is committed to any of the tables.

Transactional loading can require rows to be buff-ered to ensure the correct load order. If the data being buffered is larger than the virtual memory available, Data Integrator reports a memory error.

The tables must be from the same datastore.

Note: If you choose to enable transactional load-ing, these options are not available: Rows per commit, Use overflow file, and overflow file specifi-cation, Number of loaders, Enable partitioning, and Delete data from table before loading

Data Integrator also does not parameterize SQL or push operations to the database if transactional loading is enabled.


Transaction order Transaction order indicates where this table falls in the loading order of the tables being loaded. By default, there is no ordering.

All loaders have a transaction order of zero. If you specify orders among the tables, the loading oper-ations are applied according to the order. Tables with the same transaction order are loaded together. Tables with a transaction order of zero are loaded at the discretion of the data flow pro-cess.


Option Description


Executing the job

After completing this unit, you will be able to:• Understand job execution• Execute the job

After you create your project, jobs, and associated data flows, you can then execute the job. You can run Data Integrator jobs in three different ways. Depending on your needs, you can configure:• Immediate jobs

Data Integrator initiates both batch and real-time jobs and runs them immediately from within the Designer. For these jobs, both the Designer and designated Job Server (where the job executes, usually many times on the same machine) must be running. You will most likely run immediate jobs only during the development cycle.

• Scheduled jobsBatch jobs are scheduled. To schedule a job, use the Data Integrator Web Administrator or use a third-party scheduler. When jobs are scheduled by third-party software:• The job initiates outside of Data Integrator• The job operates on a batch job (or shell script for UNIX) that has

been exported from Data IntegratorWhen a job is invoked by a third-party scheduler:• The corresponding Job Server must be running• The Data Integrator Designer does not need to be running

Note: A job with syntax errors does not execute.

Introduction

Understanding job execution


You must associate a job to a project before you can execute the job.

To associate the job to a project, either create the job and corresponding hierarchy objects in the Project area, or drag the project with the associated objects from the Local object library into a project folder in the Project area.

If a project folder is in the Local object library Project tab but not in the Project area, double-click the project folder in the project tab to move the folder into the Designer tab of the Project area.

Immediate or on demand tasks are initiated from the Data Integrator Designer. Both the Designer and Job Server must be running for the job to execute.

To execute a job as an immediate task 1 In the project area, select the job name.2 Right-click and select Execute.

Data Integrator prompts you to save any objects that have changes that have not been saved.

Options in the Execution Properties window such as Enable auditing and Enable recovery are discussed later in the course.

3 Click OK.As Data Integrator begins execution, the execution window opens with the trace log button active.

Executing the job


Activity: Defining a data flow to load data into a dimension

ObjectivesIn this activity you will use the ODS_CUSTOMER source table to define the data flow to load data into the predefined CUST_DIM in the target data warehouse.

To complete this activity successfully, you will:• Create a project and add job, work flow, and data flow to the project• Define the data flow • Map the Query transform• Execute the job

Instructions

1 In the Designer Local object library, click to access the Project tab.2 Right-click the project tab area, and select New.3 Create a new project called Exercises.

The Exercises folder now displays both in the Local object library project tab and in the Designer tab of the project area.

4 In the Designer project area, right-click the Exercises project:

5 Select New Batch Job, and name this job CustDim_Job.6 Remaining in the Designer Project area, click CustDim_Job to activate

the job workspace area in the right side of the Designer.

7 From the Tool Palette on the right side of the Designer window, click and click the workspace to add a work flow to the job workspace area.

Practice


8 Name this work flow CustDim_WF:

Note: Navigate to the Work Flow tab in the Local object library area and you will see that CustDim_WF has also been added to the Local object library. You can later access this work flow from the Local object library to reuse the work flow in other jobs.

9 In the workspace, double-click CustDim_WF to open it.

10 From the Tool Palette click and click the workspace again to add a data flow to work flow workspace.

11 Name the data flow CustDim_DF.Note: You can also create the project, job, work flow, and data flow

objects from the Local object library by accessing the respective tabs. In the corresponding tab, right-click the area (eg. Job tab area), and select New. After the new object is created (eg. NewJob), right-click the object, and select Rename to rename the object.

Defining the data flow1 In the Designer tab of the Project area, click CustDim_DF to go to the

data flow level.

2 From the Local object library, click to access the Datastores tab.3 Click the (+) node to expand ODS_DS, and expand Tables.4 Click the ODS_CUSTOMER table, and holding the cursor down, drag it to

the data flow workspace.5 From the drop-down list that appears, select Make Source.

6 From the Local object library, click to access the Transforms tab.7 Click and drag the Query transform onto the workspace. Place the Query

transform beside the ODS_CUSTOMER table.8 From the Local object library, click the Datastores tab again.9 In Target_DS, expand Tables.10 Click and drag the CUST_DIM table onto the worskpace. 11 Place CUST_DIM beside the Query transform, and select Make Target.12 Double-click CUST_DIM to open it.13 Click the Options tab, and select Delete data from table before

loading.Note: If you want to run your job more than once, you can set the target

table options to either delete rows before updating (if you do not want to keep old rows) or use auto-correct load if you want to keep old rows and update the target when you re-execute the job.


Mapping the Query transform1 In the data flow level, click and drag from the square on

ODS_CUSTOMER to the triangle on the Query transform to connect these two objects together.

2 Connect the Query transform to CUST_DIM.1 Double-click the Query transform to open it.2 From the Schema In pane of the Query transform editor, click CUST_ID,

and holding down the cursor drag CUST_ID to the CUST_ID column in the Schema Out pane, and select Remap Column:

3 Repeat the same procedure and remap these columns:

4 From the Designer tab of the Project area, click CustDim_Job to go back to the job level.

5 Right-click CustDim_Job, and select Execute.

6 In the Execution Properties window, click OK to execute the job.A log appears on the right side of the Designer area to notify you that the job is executed successfully.

CUSTCLASS_F CUSTCLASS_F

NAME1 NAME1

ADDRESS ADDRESS

CITY CITY

REGION_ID REGION_ID

ZIP ZIP


Activity: Using a format file to populate a target table

ObjectivesIn this activity you will create a job and define a data flow by using the:• Format file Format_SalesOrg that you created earlier as a source• SALESORG_DIM dimension table as your target table• Query transform to map the input columns from Format_SalesOrg to the

output columns from the SALESORG_DIM target table

Instructions7 In the Exercises project, create a new batch job called SalesOrg_Job

and add a work flow named SalesOrg_WF. 8 Add a data flow to SalesOrg_WF. Name this data flow SalesOrg_DF.9 Click SalesOrg_DF.

You should see a new tab with the data flow name at the bottom of the Workspace.

10 Add the file format named Format_SalesOrg to the workspace and make it a source.

11 Add the Query transform to the data flow, and connect source file to the query.

12 Select the table SALESORG_DIM from the Target_DS datastore, add it to the workspace, and make it a target.Tip: If you want to run your job more than once, set the target table

options to either delete rows before updating (if you do not want to keep old rows) or use auto-correct load if you want to keep old rows and update the target when you re-execute the job.

13 Connect the Query transform to the salesorg_dim table.14 Double-click the Query transform.15 Remap these columns:

16 Execute the job.

Practice

SalesOffice SalesOffice

Region Region

DateOpen DateOpen


Adding a new table to the target using template tables

After completing this unit, you will be able to:

• Use template tables

During the initial design of an application, you might find it convenient to use template tables to represent database tables. Template tables are particularly useful in early application development when you are designing and testing a project.

With template tables, you do not have to initially create a new table in your DBMS and import the metadata into Data Integrator. Instead, Data Integrator automatically creates the table in the database with the schema defined by the data flow when you execute a job.

After creating a template table as a target in one data flow, you can use it as a source in other data flows. Although a template table can be used as a source table in multiple data flows, it can be used only as a target in one data flow.

You can modify the schema of the template table in the data flow where the table is used as a target. Any changes are automatically applied to any other instances of the template table. During the validation process, you receive warnings of any errors, such as errors that result from changing the schema.

After you are satisfied with the design of your data flow, save it. When the job is executed, Data Integrator uses the template table to create a new table in the database you specified when you created the template table.

After a template table is created in the database, you can convert the template table in the repository to a regular table. You must convert template tables so that you can use the new table in expressions, functions, and transform options.Note: After a template table is converted, you can no longer alter the schema.Tip: Ensure that any files that reference flat file formats are accessible from

the Job Server where the job will be run and specify the file location relative to this computer.

Introduction

Using template tables


To create a template table1 Open a data flow in the Designer workspace.

2 In the Designer tool palette, click and drag into your workspace to add a new template to the data flow.

3 In the Create Template window, enter the name for the template table:

4 Click OK.Tip: You also can create a new template table in the object library

datastore tab by expanding a datastore and right-clicking Templates.

To convert a template table into a regular table from the object library1 Open the object library, select the Datastores tab.2 Click the plus sign (+) next to the datastore that contains the template

table you want to convert.A list of objects appears.

3 Click the plus sign (+) next to Template Tables.The list of template tables appears.

4 Right-click a template table you want to convert and select Import Table.

Data Integrator converts the template table in the repository into a regular table by importing it from the database.

5 To update the icon in all data flows, select View > Refresh. In the datastore object library, the table is now listed under Tables rather than Template Tables.


To convert a template table into a regular table from a data flow1 Open the data flow containing the template table.2 Right-click the template table you want to convert and select Import

Table.

After a template table is converted into a regular table, you can no longer change the table’s schema.


Activity: Using template tables

Objective In this activity you will:• Create a new file format to define the source data from a text file• Create a job from the Local object library and load the newly defined data

into a template table• Import the template table to convert it into a regular table

Instructions1 In the Local object library, click the Jobs tab, and create a new job called

Template_Job.You do not need a work flow in this exercise.

2 Create a new data flow called Template_DF.3 In the Formats tab of the object library, create a flat file called

Name_Position.Note: Spacing between words when naming a flat file is not supported.

4 In the File format Editor, under the Data File Root directory field, navigate to the source data located in C:\<DI install folder>\Tutorial Files.

5 In the Data File area, under File name(s), select NamePos.txt.The Column Attributes and Data Preview area of the File Format Editor should populate with the NamePos.txt source data values.

6 Under the Property Input/Output area, change the value for the Skip row header field to Yes.

7 In the Column Attributes area, change the values in the Field Name column:• Field 1 to FirstName• Field 2 to LastName• Field 3 to Position

8 Save and close the changes you made in the Name_Position flat file.9 Double-click Template_DF to open it in the workspace. 10 Drag the Name_Position flat file onto the workspace and make it a

source. 11 Add the Query transform and join it to the source flat file.

Practice


12 Use the Designer Tool palette to create a target Template table to your data flow. You can call this template table Template:

13 Connect the Query transform to the template table.14 Double-click the Query transform. 15 Drag the First Name and Last Name columns from the Schema in area

and remap them to the Schema Out area in the Query Editor.16 Save the job.17 Drag the Template_Job from the Local object library to the Exercises

project in the project area, and right-click Template_Job to execute the job.

18 You should get the following message that the job ran correctly.

19 Click the magnifying class on Template table to view the resulting data. The column headings should display as First Name and Last Name.

20 From the Designer Local object library, in the Datastores tab, expand Target_DS, and expand Template Tables.

21 Right-click Template, and select Import Table. Expand Tables under Target_DS and you should see Template is now an actual table in the datastore.


Lesson Summary

Quiz: Creating a Batch Job1 Does a data flow have to be part of a job?

2 How do you add a new template table?

3 Name some of the objects contained within a project

What is a conditional? Would it appear in a data flow, or in a work flow?

4 Explain the concept of a try/catch block. How do they fit into a Data Integrator work flow?

Review


After completing this lesson, you are now able to:• Create a project• Create a job• Add, connect, and delete objects in the workspace• Create a work flow• Create a data flow• Explain source and target objects• Add source and target objects to a data flow• Explain what a transform is• Understand the Query transform• Describe the Query editor window• Use the Query transform in a data flow• Understand job execution• Execute the job• Use template tables

Summary


Lesson 5Validating, Tracing, and Debugging Batch Jobs

Before you execute a job, you can view the SQL generated by your data flow and adjust your design to maximize the SQL that is pushed down to improve performance. You can also validate jobs, set trace options, and logs to debug jobs.

It is also helpful to use descriptive annotations for your jobs, work flows and data flows.

In this lesson you will learn about:• Understanding pushed-down operations• Using descriptions and annotations• Validating and tracing jobs• Using View Data and the Interactive Debugger

Duration: 1.5 hours


Understanding pushed-down operations

Pushing down operations to the source or target database reduces the number of rows and operations that the engine must retrieve and process, thus, improving performance.

After completing this unit, you will be able to:• List operations that Data Integrator pushes down to the database• View SQL generated by a data flow

Data Integrator examines the database and its environment when determining which operations to push down to the database.

Operations that Data Integrator pushes to the database include:• Aggregations

Aggregate functions, typically used with a Group by statement, always produce a data set smaller than or the same size as the original data set.

• Distinct rowsData Integrator will only output unique rows when you use distinct rows.

• Filtering: Filtering can produce a data set smaller than or equal to the original data set.

• JoinsJoins typically produce a data set smaller than or similar in size to the original tables.

• OrderingOrdering does not affect data-set size. Data Integrator can efficiently sort data sets that fit in memory. Since Data Integrator does not perform paging (writing out intermediate results to disk), it is recommended that you use a dedicated disk-sorting program such as SyncSort or the DBMS itself to order very large data sets.

• ProjectionsA projection normally produces a smaller data set because it only returns columns referenced by a data flow.

• FunctionsMost Data Integrator functions that have equivalents in the underlaying database are appropriately translated.

Transform operations that are not pushed down include:• Expressions that include Data Integrator functions that do not have

database correspondents• Load operations that contain triggers

Introduction

Pushed-down operations

Validating, Tracing, and Debugging Batch Jobs—Learner’s Guide 5-3

Similarly, not all operations can be combined into single requests.

For example, when a stored procedure contains a COMMIT statement or does not return a value, you cannot combine the stored procedure SQL with the SQL for other operations in a query. You can only push operations supported by the DBMS down to that DBMS. Note: You cannot push built-in functions or transforms to the source

database. For best performance, do not intersperse built-in transforms among operations that can be pushed down to the database.Database specific functions can only be used in situations where they will be pushed down to the database for execution.

Before running a job, you can view the SQL generated by the data flow and adjust your design to maximize the SQL that is pushed down to improve performance. Alter your design to improve the data flow when necessary.

To view SQL

1 In the Local object library, click , and select the data flow.

2 Right-click the data flow, and select Display Optimized SQL.The Optimize SQL screen appears showing a list of datastores and the optimized SQL code for the selected datastore.

Note: Data Integrator only shows the SQL generated for table sources. Data Integrator does not show the SQL generated for SQL sources that are not table sources, such as the lookup function, key_generation transform, key_generation function, table_comparison transform, and target tables.

Viewing SQL generated by a data flow


Using descriptions and annotations

Descriptions are a convenient way to add comments to workspace objects. You can use descriptions to document objects on the workspace diagram.

After completing this unit, you will be able to:• Use descriptions with objects • Use annotations to describe job, work, and data flows

A description is associated with a particular object. When you import or export a repository object (for example, when migrating between development, test, and production environments), you also import or export its description.

The Designer determines when to show object descriptions based on a system-level setting and an object level setting. Both settings must be activated to view the description for a particular object.Note: The system-level setting is unique to your setup.

To add a description to an object1 In the project area, right click an object and select Properties.2 Enter your comments in the Description text box.3 Click OK.

The description for the object displays in the object library.Note: The Designer might display a warning that you are modifying the

description of a reusable object. This means that other Data Integrator applications or jobs using this object will also have the added description. Click Yes if you want this description to appear when the object outside of your current application or job.

To display a description in the workspace1 In the project area, select an existing object, such as a job, that contains

an object to which you have added a description, such as a workflow.2 Display the object in the workspace, right-click the object in the

workspace, and select View Enabled Object Descriptions.The description displays in the workspace under the object.

3 Double-click the annotation icon in the workspace to add, edit, and delete text directly onto the annotation.

Introduction

Using descriptions with objects


Annotations describe a flow, part of a flow, or a diagram in a workspace. An annotation is associated with the job, work flow or data flow where it appears. When you import or export a job, work flow, or data flow that includes annotations, you also import or export associated annotations. You can use annotations to describe any workspace such as a job, work flow, data flow, catch, conditional, or while loop.

As a best practice, write an annotation whenever you create an object.

To add an annotation to a job, work flow, data flow, or diagram in the workspace1 Open the workspace diagram you want to annotate.

2 In the tool palette, click .3 Click the location in the workspace to place the annotation.

An annotation appears on the diagram.Add text on the annotation.Note: In addition, you can resize and move the annotation by clicking

and dragging. You can add any number of annotations to a diagram. You cannot hide annotations that you have added to the workspace. However, you can move them out of the way or delete them.

Using annotations to describe job, work, and data flows


Validating and tracing jobs

It is a good idea to validate your jobs when you are ready for job execution. Before job execution you can also select and set specific trace properties for your job. After the job executes, you can use the various log files to help you read job execution status or troubleshoot job error logs.

After completing this unit, you will be able to:• Validate jobs• Trace jobs• Use log files

As a best practice, you want to validate your work as you build objects so that you are not confronted with too many warnings and errors at one time.You can validate your objects as you create a job or you can automatically validate all your jobs before executing them.

To validate• From the Validation menu, select Validate>All Objects in View.

To validate jobs automatically before job execution1 From the Designer toolbar, click Tools, and select Options.2 In the Options window, expand Designer, and click General.3 From the list of options displayed, select Perform complete validation

before job execution.

Handling validation errors1 Right-click on message

View displays the complete validation message with text wrapping.2 Select Go To Error to open the object where the error occurred.

Introduction

Validating jobs


Setting job execution optionsOptions for jobs include Debug and Trace. Although these are object options—they affect the function of the object—they are located in either the Property or the Execution window associated with the job.

Execution options for jobs can either be set for a single instance or as a default value.• The right-click Execute menu sets the options for a single execution only

and overrides the default settings• The right-click Properties menu sets the default settings

To set execution options for every execution of the job 1 From the Project area, right-click the job name and select Properties.2 Select options from the Properties window.

Use trace properties to select the information that Data Integrator monitors and writes to the monitor log file during a job. Data Integrator writes trace messages to the monitor log associated with the current Job Server and writes error messages to the error log associated with the current Job Server.

To set trace options1 From the Project area, right-click the job name, and select Execute.2 Click Trace tab.3 Under the name column, click a trace object name.

The Value drop-down list is enabled when you click a trace object name.

4 From the Value drop-down list, select Yes to turn the trace on.

Tracing jobs


5 Click OK.Tip: You can enable your job to print all trace messages by selecting the

Print all trace messages in the job Execution Properties window.

Below is a list of some of the trace options available. For a complete list, see “Batch Job Descriptions” under Description of Objects in the Data Integrator Reference Guide.

Trace Description

Row Writes a message when a transform imports or exports arow.

Session Writes a message when the job description is read fromthe repository, when the job is optimized, and when thejob runs.

Work Flow Writes a message when the work flow description is readfrom the repository, when the work flow is optimized,when the work flow runs, and when the work flow ends.

Data Flow Writes a message when the data flow starts, when thedata flow successfully finishes, or when the data flowterminates due to error.

SQL Readers Writes the SQL query block that a script, querytransform, or SQL function submits to the system. Alsowrites the SQL results.

SQL Loaders Writes a message when the bulk loader starts, submits a warning message, or completes successfully or unsuccess-fully.

It also reports all SQL that Data Integrator submits to the target database, including:

When a truncate stm command executes if the Delete data from table before loading option is selected.

Any parameters included in PRE-LOAD SQL commands

Before a batch of SQL statements is submitted

When a template table is created (and also dropped, if the Drop/Create option is turned on)

When a delete stm command executes if autocorrect is turned on (Informix environment only).


As a job executes, Data Integrator produces three log files. You can view these from the Designer project area. The log files are, by default, also set to display automatically in the workspace when you execute a job.

To enable the monitor to display log files in the workspace• From the main toolbar, select Tools > Options > Designer > General >

Open monitor on job execution.

This table lists the log files created during job execution:

When you execute the job, the logs are displayed as shown:

You can click on the monitor, statistics, error, and warning log icons to view these logs.Note: You can determine how long you want to keep logs using the Data

Integrator Administrator. The Data Integrator Administrator is covered in Lesson 13 of this course.

Using log files

Tool Description

Monitor log Displays each step of each data flow in the job, the number of rows streamed through each step, and the duration of each step.

Statistics log Itemizes the steps executed in the job and the time execu-tion begins and ends.

Error log Displays the name of the object being executed when a Data Integrator error occurred. Also displays he text of the resulting error message.


Examining monitor logsUse the monitor logs, as shown below, to determine where an execution failed, whether the execution steps occur in the order you expect, and which parts of the execution are the most time consuming.

Examining statistics logsThe statistics log quantifies the activities of the components of the job. It lists the time spent in a given component of a job and the number of data rows that streamed through the component as shown:

Examining error logsData Integrator produces an error log for every job execution. Use the error logs to determine how an execution failed. An example of an error log is shown:

Note: If the execution completed without error, the error log is blank.


To copy log content from an open log• Select one or multiple lines and use the key commands [Ctrl+C].

To delete a log 1 In the monitor area, right-click the log you want to delete, and select

Delete log.

Displaying job logs in the Designer project areaAfter a job is executed, these logs appear in the Designer project area Monitor and Log tabs.

To view log files from the Designer project area1 From the Designer project area, click the Log tab. 2 In the filter drop-down list, select the log you wan to view.

Use the trace, statistics, and error log icons to view each type of available log for the date and time that the job was run.

Monitor tabThe Monitor tab lists the trace logs of all current or most recent executions of a job.

The traffic-light icons in the Monitor tab are a:• Green light indicates that the job is running.

You can right-click, and select Kill Job to stop a job that is still running.• Red light indicates that the job has stopped.

You can right-click, and select Properties to add a description for a specific trace log. This description is saved with the log which can be accessed later from the Log tab.

• Red cross indicates that the job encountered an error.


Log tabYou can also select the Log tab to view a job’s trace log history.

Note: The icon indicates that the job encountered an error on this explicitly selected Job Server.

If you are using multiple Job Servers, you may also find these additional job log indicators:

Determining success for job executionThe best measure of the success of a job is the state of the target data. Always examine your data to make sure the data movement operation produced the results you expect. Be sure that:• Data was not converted to incompatible types or truncated.• Data was not duplicated in the target.• Data was not lost between updates of the target.• Generated keys have been properly incremented.• Updated values were handled properly

If a job fails to execute:• Check the job server icon in the status bar.• Verify that Job Service is running.• Check that the port number in Designer matches the number specified in

Server Manager.• Use Server Manager’s resync button to reset the port number in the local

repository.

Job log indicator

Description

Indicates that the job executed successfully on this explicitly selected Job Server.

Indicates that the was job executed successfully by a server group. The Job Server listed executed the job.

Indicates that the job encountered an error while being exe-cuted by a server group. The Job Server listed executed the job.


Activity: Setting traces and adding annotations in a job

ObjectivesIn this activity you will:• Execute a SalesOrg_Job to trace the start and end time of your work flow

and data flow• Add a descriptive annotation to a job

Instruction1 In the project area, right-click to execute the SalesOrg_Job.2 In the Execution properties window, under the Trace tab, set these trace

values to NO: Trace Session Trace Work Flow, and Trace Data Flow.3 Change these trace values to YES: Trace SQL Readers and Trace SQL

Loaders.4 Click OK to execute the job.

The only output will be the printout of the 2 traces you selected.5 Click the statistics icon to display information on the State, Row Count,

and time it took the job to complete.Tip: You can also view the trace logs by clicking the Log tab in the

Designer project area and filtering the drop-down list to display Trace Logs.

6 From the tool palette, add an annotation to the top left-hand side of your data flow, and enter: This is an example of a job using view data different traces.

7 Resize the annotation box accordingly.

Practice


Using View Data and the Interactive Debugger

You can debug jobs in Data Integrator using the View Data and Interactive Debugger features. With View Data you can view imported source data, transforms, and ending data in your targets for your jobs. Using the Interactive debugger you can examine what happens to the data after each transform or object in the flow.

After completing this unit, you will be able to:• Use View Data with sources and targets• Use the Interactive Debugger• Set filters and breakpoints for a debug session

With View Data, you can check the status of data at any point after you import a data source, and before or after you process your data flows. You can check the data when you design and test jobs to ensure that your design returns the results you expect.

If you want to view data for a source or target file, the source or target file must physically exist and be available from your computer’s operating system. To view data for a table, the table must be from a supported database.

Sources View Data allows you to see source data before you execute a job. Using data details you can:• Create higher quality job designs• Scan and analyze imported table and file data from the object library• See the data for those same objects within existing jobs• Refer back to the source data after you execute the job.

TargetsView Data allows you to check your target data before executing your job, then look at the changed data after the job executes. In a data flow, you can use one or more View Data panels to compare data between transforms and within source and target objects.

Introduction

Using View Data with sources and targets


To use View Data in source and target tables• In the Datastore tab of the object library, right-click a source or target

table and select View Data.The View Data window displays:

View Data displays your data in the rows and columns of a data grid. The number of rows displayed is determined by a combination of several conditions:• Sample size: the number of rows sampled in memory. Default sample

size is 1000 rows for imported source, targets and transforms.• Filtering• Sorting: if your original data set is smaller or if you use filters, the number

of returned rows could be less than the default.


To open a View Data pane in the Designer workspace1 Click the magnifying glass button on a data flow object.

A large View Data pane appears beneath the current workspace area.

2 Click the magnifying glass button for another object and a second pane appears below the workspace area. The first pane area shrinks to accommodate the presence of the second pane.

Tip: When both panes are filled and you click another View Data button, a small menu appears containing window placement icons. The black area in each icon indicates the pane you want to replace with a new set of data. Click a menu option and the data from the latest selected object replaces the data in the corresponding pane.

The path for the selected object displays at the top of the pane.

For sources and targets, this consists of the full object name:• ObjectName(Datastore.Owner) for tables• FileName(File Format Name) for files

Tips for using View Data• Use one or more View Data windows to view and compare sample data

from different steps.• Use View Data while you are building a job to ensure that your design

returns the results you expect.


The Designer includes an interactive debugger that allows you to examine and modify data row by row by placing filters and breakpoints on lines in a data flow diagram during a debug mode job execution. Using the debugger you can examine what happens to the data after each transform or object in the flow.

The debug mode provides the interactive debugger’s windows, menus, and toolbar buttons that you can use to control the pace of the job and view data by pausing the job execution using breakpoints or the Pause Debug option.

When you run the interactive debugger, the Designer displays three additional windows; Call stack, Trace, and Variables and View Data panes.

The left View Data pane shows the data in the source table, and the right pane shows one row at a time (the default when you set a breakpoint) that has passed to the query.

The option that is unique to the Debug Properties window is Data sample rate. The Data sample rate is the number of rows cached for each line when a job executes using the interactive debugger.

For example, in the following data flow diagram, if the source table has 1000 rows and you set the Data sample rate to 500, then the Designer displays up to 500 of the last rows that pass through a selected line. The debugger displays the last row processed when it reaches a breakpoint.Note: View Data is available both inside and outside of the interactive

debugger. However, viewing changed data in transforms can only be accessed by running the interactive debugger

Using the Interactive Debugger


To start the interactive debugger1 In the project area, select a job to debug.2 From the toolbar click Debug, and select Start Debug. 3 Set properties for the execution. 4 Click OK.

The debug mode begins. While in debug mode, all other Designer features are set to read-only.

To use View Data in the interactive debugger1 In the project area, select a job and run the debug mode.2 You should see a message asking you if you want to exit the interactive

debugger.3 Click No.4 Go back to the data flow level and you see magnifying glasses appear on

both source and target connecting lines.5 Click the magnifying glass on each connecting line.

The original source data is displayed against the changed data in the transform:

To exit the debug mode 1 From the toolbar, click Debug, and select Stop Debug.

To view job sample data using View Data1 In the workspace, drill into the last job you executed.2 To view captured data, click the View Data button displayed over a

transform in a data flow.3 Navigate through the data to analyze it.


You can set filters and breakpoints on lines in a data flow diagram before you start a debugging session that allows you to examine and modify data row-by-row during a debug mode job execution.

A debug filter functions as a simple Query transform with a WHERE clause; however, complex expressions are not supported in a debug filter. Use a filter if you want to reduce a data set in a debug job execution.

When you set a filter, you place a debug filter on a line between a source and a transform or two transforms. If you set a filter and a breakpoint on the same line, Data Integrator applies the filter first. The breakpoint can only see the filtered rows.

Like a filter, you can set a breakpoint between a source and transform or two transforms. A breakpoint is the location where a debug job execution pauses and returns control to you.

A breakpoint condition applies to the after image for UPDATE, NORMAL and INSERT row types and to the before image for a DELETE row type. Instead of selecting a conditional or unconditional breakpoint, you can also use the Break after n row(s) option. In this case, the execution pauses when the number of rows you specify passes through the breakpoint.

You can use a breakpoint with or without conditions. If you use a breakpoint without a condition, the job execution pauses for the first row passed to a breakpoint. If you use a breakpoint with a condition, the job execution pauses for the first row passed to the breakpoint that meets the condition.Note: If you do not set pre-defined filters or breakpoints, the Designer will

optimize the debug job execution. This means that the first transform in each data flow of a job is pushed down to the source database. As a result, you cannot view the data in a job between its source and the first transform unless you set a pre-defined breakpoint on that line. For more information on the interactive debugger, setting filters and breakpoints, see “Design and Debug”, Chapter 15 in the Data Integrator Designer Guide.

Setting filters and breakpoints for a debug session


To set filters and breakpoints1 Open the job you want to debug with filters and breakpoints in the

workspace.2 Open one of its data flows.3 Right-click the connecting line that you want to examine, and select Set

Filter/Breakpoint.

4 In the breakpoint window, under the Filter or Breakpoint columns, select the Set check box.

5 Complete the Column, Operator and Value columns accordingly.


Activity: Using the interactive debugger with filters and breakpoints

ObjectivesIn this activity you will practice using the Interactive debugger by setting a breakpoint to stop the data flow when the interactive debugger reaches a row with a Region value of NW, and view the data in debug mode.

This breakpoint allows you to pause the job execution when a certain condition is met, as in this example, the debugger reaches the first row it encounters with the Region value of NW, and stops. This allows you to pause with control returned to you, and examine each row returned that meets the condition you specify.”

Instructions1 Open the SalesOrg_DF in the workspace.2 Right-click the line between the Format_SalesOrg source and the Query

transform.3 Select Set Filter/Breakpoint.

4 In the Breakpoint pane, select the Set check box.5 In the Breakpoint pane, click the field under Column.6 From the drop-down list, select Format_SalesOrg.Region.7 Click the field under Operator and select =.8 Click the field under Value, and type NW, and click OK.

Practice


A breakpoint is placed on the line connecting the source table and Query transform:

9 In the Designer project area, right-click SalesOrg_Job, select Start debug, and click OK. The debugger stops after processing the first row with a Region value of NW, as displayed in the View pane:

10 From the Designer bar, click to get the next row.You can see that the next row replaces the existing row in the View pane.

11 Get the next row again.12 In the View pane, click the All check box to see all rows.

13 From the Designer bar, click to stop debug mode.


Lesson Summary

Quiz: Validating, Tracing, and Debugging Batch Jobs1 What must be running in order to execute a job immediately?

2 List some reasons why a job might fail to execute.

3 List some ways for ensuring that the job executed correctly.

4 Explain the View Data option. Is this data permanently stored by Data Integrator?

5 List and explain the conditions that determine how many rows of data will be visible in the View data pane.

Review


After completing this lesson, you are now able to:• List operations that Data Integrator pushes down to the database• View SQL generated by a data flow• Use descriptions with objects• Use annotations to describe job, work, and data flows• Validate jobs• Trace jobs• Use log files• Use View Data with sources and targets• Use the Interactive Debugger• Set filters and breakpoints for a debug session

Summary


Workshop

Reading multiple file formats

Duration: 45 minutes to 1 hour

ObjectiveIn this activity you will use the concepts you have learned in Lessons 1 to 5 to: • Create a file format that reads more than one identical file format• Use the Query transform to create a data flow that loads data from flat file

sources to a target template table• Log flat file errors that occur during the data movement

ScenarioAs an ETL engineer, it is very common for you and other database administrators to exchange data using flat files.

You are given three different file formats that you need to categorize and organize to see if there is some similarity between the file formats. If there is similarity between these formats, then you may consider moving all three file formats simultaneously from source to target. You also want to catch any conversion errors and use this new data in other Data Integrator ETL jobs.

Instructions1 Create a temp folder in your computer’s C: drive.2 From the resource CD, copy these flat files to the C:\temp folder you just

created.• orders011197.txt• orders071196.txt• orders071197.txt

3 Use these naming conventions to create your job and data flow: • Orders_FlatFile_Job• Orders_FlatFile_DF

4 Open each flat file in Notepad and try to determine the commonalities between them. Things you want to look for are: • Flat file type• Column type• Text delimiter• Date formats• Bad rows• Existing row headers.

Practice


5 Your findings should match this list:• The files are delimited• The column delimiter is a semicolon (;)• The text delimiter is double-quote (")• The date format is dd-mon-yyyy• Some rows in the files are bad and are indicated by (//) in the

beginning of a row. Ignore these rows. Do not read them.• The file has a header row which contains the column names. These

rows should be skipped.6 Now that you have analyzed your flat files, create a single file format

using your findings and the list criteria above. Name this flat file Orders_Flat_file.Tip: Use one of the .txt files provided to define your initial file format.

7 Since the column data in the file (date, numbers) can be bad, enable error handling and capture data conversion errors. Write these errors to c:\temp\file_errors.txt

8 Also note that the file has 9 columns so your new schema should look like this:

Note: Since order sequencing for defining the values within the file format is important, make sure that you define all the values on the left-pane of the File Format Editor before you define the columns in the Column Attributes area.

Tip: Before you save and close your file, remember that Data Integrator uses a wildcard to read a similar list of file names.

9 Define your data flow using Orders_Flat_file as your source and the Query transform.

10 Create a Target template table called ORDERSINFO.11 There are no transformations needed but you still have to map the

columns in the Query transform.12 Validate and execute the job.

Note: You must drag the job from the Local object library into a project folder in the Designer project area before executing the job.Data conversion warning messages are ok. Your job should execute successfully but notice that your job has generated two errors.

13 Import the ORDERINFO template table into your Target database.


Questions1 How many rows are loaded into the table?2 Browse to the file_errors.txt and open it. How many errors are in the

error file? Explain why these rows may have appeared.3 Why do you have to import the template table into your database?

Solutions1 The total rows in your target should be 217. Click the magnifying glass on

ORDERSINFO and the row count should be 217.2 You should see two lines representing bad data in this error file. This is

because step 6 specifies that there could be bad data and you enabled error handling to capture these.

3 Expand your Target datastore and check to see that ORDERINFO is no longer a template table but is now a permanent table in your Target database.It is recommended that you import your template table into the database. When you import the template table, the imported table definition is available to the Designer. You must convert template tables so that you can use the new table in expressions, functions, and transform options.

A solution file— Multiple_FlatFiles_Solution.atl—is also included in your resource CD. Import this .atl file into the Designer to view the actual job, data flow and file format definitions.

To import this .atl solution file, right-click in your object library, select repository and click Import from File. Browse to the resource CD to locate the file and open it. Note: Do not execute the solution job as this may override the results in your

target table. Use the .atl solution file as a reference to view your data flow design and mapping logic.


Lesson 6 Using Built-in Transforms and Nested Data

A transform enables you to control how data sets change in a data flow.

In this lesson you will learn about:• Describing built-in transforms• Understanding nested data• Understanding operations on nested data

Duration: 2 hours


Describing built-in transforms

In addition to the Query transform, some of the most commonly used transforms are the Case, Merge, Validation, and Date_Generation transforms.Note: The activity for the Date_Generation transform is in Lesson 7.

For the purpose of this course, you will learn in detail about how these most commonly used transforms work. You will also be able to practice using these transforms in activities. For more information on the remaining transforms, refer to the Data Integrator Reference Guide.

The XML_Pipeline transform is covered later in this lesson. You also learn about the history preserving in Lessons 9 of this course.

After completing this unit, you will be able to:• Describe and use the most commonly used built-in transforms in a data

flow

A transform is a step in a data flow that acts on a data set. Transforms manipulate data input sets and produce one or more output data sets. You can choose to edit the input data, options and output data in a transform.Note: Not all transforms offer input data options.

Data Integrator also maintains operation codes that describe the status of each row in each data set. Operation codes are described by the inputs to and outputs from objects in data flows. You can use operation codes with transforms to indicate how each row in the data set is applied to a target table.

Introduction

Built-in transforms

Using Built-in Transforms and Nested Data—Learner’s Guide 6-3

Below is a list of available operation codes:

Transforms are added as components to your data flow. Each transform provides different options that you can specify based on the transform’s function.

Data Integrator also offers you embedded data cleansing technology from Firstlogic. This partnership allows you to have access to new transformations for parsing, standardizing, correcting, matching, and merging customer records.Note: You can only use the Name Parsing, Match Merge and Address

Enhancement transforms if you have purchased and installed Firstlogic.

For more information go to http://www.businessobjects.com/products/dataintegration/default.asp or download the Firstlogic data cleansing

Operation code Description

NORMAL Creates a new row in the target.

All rows in a data set are flagged as NORMAL when they are extracted by a source table or file. If a row is flagged as NORMAL when loaded into a target table or file, it is inserted as a new row in the target.

Most transforms operate only on rows flagged as NORMAL.

INSERT Creates a new row in the target.

Rows can be flagged as INSERT by the Table_Comparison transform to indicate that a change occurred in a data set as compared with an earlier image of the same data set.

The Map_Operation transform can also produce rows flagged as INSERT. Only History_Preserving andKey_Generation transforms can accept data sets with rows flagged as INSERT as input.

DELETE Ignored by the target. Rows flagged as DELETE are not loaded.

Rows can be flagged as DELETE only by the Map_Operation transform.

UPDATE Overwrites an existing row in the target table.

Rows can be flagged as UPDATE by the Table_Comparison transform to indicate that a change occurred in a data set as compared with an earlier image of the same data set.

The Map_Operation transform can also produce rowsflagged as UPDATE. Only History_Preserving andKey_Generation transforms can accept data sets with rows flagged as UPDATE as input.


information sheet at: http://www.businessobjects.com/global/pdf/products/dataintegration/bodi_Data_Cleansing_Information_Sheet.pdf.

Each transform provides different data input, options and data output that you can specify in the Transform Editor. The next section gives a description the function, data input requirements, options and data output results for each built-in transform.

Case transformThe Case transform provides case logic based on row values and operates within data flows. You use the Case transform to simplify branch logic in data flows by consolidating case or decision making logic into one transform.

You can use the Case transform to read a table that contains sales revenue facts for different regions, and separate the regions into their own tables.

For example, an East Region Table, West Region Table and so on.

This helps with more efficient data access in reporting as shown below:

Data Input

Only one data flow source is allowed as a data input for the Case transform.Note: Depending on the data, only one of multiple branches is executed per

row. The input and output schema are also identical when using the case transform.

Options

The Case transform offers several options:• Label: the name of the connection description indicating where data will

go if the corresponding Case condition is true.• Expression: the Case expression for the corresponding label. • Default: is only available if the Produce default option when all

expressions are false option is enabled. Use the expression in this label when all other Case expressions evaluate to false.

• True: if the Row can be TRUE for one case only option is enabled, the row is passed to the first case whose expression returns true. Otherwise, the row is passed to all the cases whose expression returns true.


Data Output

The connections between the Case transform and objects used for a particular case must be labeled. Each output label in the Case transform must be used at least once.

You connect the output of the Case transform with another object in the workspace. Each label represents a case expression (WHERE clause).


Activity: Using the Case transform

ObjectivesIn this activity you will define a data flow to load source data from the ODS.CUSTOMER table into regional customer data marts. Based on identified business requirements, the regional data marts are separated into Region1, Region2, and Region3.

You will use the Case transform to load data from ODS.CUSTOMER table with REGION_ID values of 1, 2, and 3 into the specified regional customer data marts.

Instructions 1 Create a new job. Name the job Case_Job.2 Create a new data flow. Name the data flow Case_DF.3 Select the ODS_Customer table as a source.4 Create three target template tables:

5 Add the Case transform from your list of transforms, and connect the source table to the Case transform.

6 Double-click the Case transform. 7 Clear these options:

• Produce default output with labelNote: When you clear this option, the deselected option becomes

Produce default output when all expressions are false. Leave this option unchecked.

• Row can be TRUE for one case only.8 In the Case Editor, click Add to add an expression.9 Change the Label value Case_out_1 to R1.

Practice


You should see a red X icon appear beside R1.

10 From the source schema in the Case Editor, drag the column REGION_ID to the Expression Editor and type =1 beside it. The expression you just typed should populate this expression field.

Note: You may need to click outside of the Expression Editor white space to populate the values you just created in the Expression column.

11 Add these remaining expressions to the Case Editor:

12 Go back to the data flow level, and connect the Case transform to the R1 target table you created.

13 Connect the Case transform to the other target tables with the respective labels R2 and R3.Your data flow should look like this:

Label Expression

R2 ODS_CUSTOMER.REGION_ID = 2

R3 ODS_CUSTOMER.REGION_ID = 3


14 Execute the job.Note: You must drag job from the Local object library into a project folder

in the Designer project area before executing the job.15 Go back to the data flow level after the job execution completes

successfully.16 Click the magnifying glass for the R1 target table.

This table should only contain customers from Region 1:

17 Click the magnifying glass for the R2 target table. This table should display customers from Region 2:

18 Click the R3 target table. These tables should display similar results as the R1 and R2 target tables.


Merge transformThe Merge transform combines incoming data sets, producing a single output data set with the same schema as the input data sets.

For example, you may need to merge the sales data of two company divisions into one.

Data Input

The Merge transform performs a union of the sources. All sources must have the same schema as shown in the diagram below, including:• The same number of columns• The same column names• The same data types of columns

If the input data set contains hierarchical data, the names and data types must match at every level of the hierarchy.

Data Output

The output data has the same schema as the source data. The output data set contains a row for every row in the source data sets. The transform does not strip out duplicate rows. If columns in the input set contain nested schemas, the nested data is passed through without change.Note: The Merge transform does not offer any options.

Name Address

Joe

Sid

Charlie

Dolly

11 Crazy Street

13 Deadhead Street

15 Yukon Lane

Goldrush Saloon

Name Address

Joe

Mollie

Charlie

Daisy

Halfway House

7 O’Brian Street

Halfway House

Goodtime Tavern

Source 2:

Source 1:

Name Address

Joe

Sid

Charlie

Dolly

Joe Halfway House

Mollie 7 O’Brian Street

Charlie Halfway House

Daisy Goodtime Tavern

11 Crazy Street

13 Deadhead Street

15 Yukon Lane

Goldrush Saloon

Output:


Activity: Using the Merge transform

ObjectivesIn this activity you will:• Use the Merge transform to read multiple data flat files that have been

given to you and load this source data into a template table.Note: The files given to you also have identical formats

• Create one single file format to read all these files at once

Instructions1 Create a new Job and call it Orders_Flatfile2_Job.2 Create a new data flow and call it Orders_Flatfile2_DF.3 In the Local object library, under the Formats tab, right-click

Orders_Flatfile, and select Replicate.4 Rename this file to Orders_FlatFile2.5 Under Data File, in the File name(s) field, delete *.txt. 6 Click Save & Close.

Note: If you get a message notifying you to save changes to the schema, click Yes to save the changes to the schema, and then click Save & Close again.

7 Open Orders_Flatfile2_DF.8 Drag the Orders_FlatFile2 files onto the workspace three separate

times and make each file format a source.9 In the workspace, double-click the first Orders_FlatFile2 format file.10 In the File Format Editor, under Data File, in the File name(s) field, click

the folder icon.11 Verify that the Folder location also points to the folder in which the

orders011197.txt file is located.12 Browse to the orders011197.txt file and open it.13 Double-click the remaining two Orders_FlatFile2 format files and browse

to the remaining orders files, respectively:• orders071196.txt• orders071197.txt

14 Add the Merge transform to the workspace.15 Add a template table to the workspace and call it ORDERSINFO_2.16 Connect the three flat files, the Merge transform, and the

ORDERSINFO_2 template table.

Practice


Your Orders_FlatFile2_DF should look like this:

17 Validate, save, and execute the job.Note: You must drag job from the Local object library into a project folder

in the Designer project area before executing the job.Data conversion warning messages are ok. As in the previous Reading multiple file formats activity, you should receive errors, even though the job should executes successfully.

The total number of rows in ORDERSINFO_2 should be 217.


Validation transformData validation addresses the needs for delivering trusted information through a productivity enhancing process that ensures the accuracy of your data. A common challenge for ETL developers is exception handling: out of range data, fields with NULL value, or incorrect data.

The Validation transform allows you to define a reusable business rule to validate each record and column. The Validation transform qualifies a data set based on rules for input schema columns. It filters out or replaces data that fails your criteria. The available outputs are pass and fail. You can have one validation rule per column.

For example, if you want to load only sales records for the month of October 2004, you may want to set up a validation rule that states: Sales Date is between 10/1/04 to 10/31/04. Data Integrator will look at this date field in each record to validate if the data meets this requirement. If it does not, you can choose to pass the record into a Fail table, correct it, or do both.

Use the Validation transform in your data flows when you want to ensure that the data at any stage in the data flow meets your criteria.

For example, you can set the transform to ensure that all:• Values in a particular column are greater than $1,000• Values in a column of phone numbers have the same format


When you use the Validation transform you select a column in the input schema and create a validation rule in the Validation transform editor:

When you create a validation rule you can input only one source in a data flow.

Your validation rule consists of a condition and an action on failure:• Use the condition to describe what you want for your valid data.

For example, specify the condition IS NOT NULL if you do not want any NULLS in data passed to the specified target.

• Use the Action on Failure area to describe what happens to invalid or failed data. Continuing on the example above, for any NULL values, you may want to select the Send to Fail option to send all NULL values to a specified FAILED target table.

For failed column values you can specify:• Where to send a row of data when a value in a column fails to meet the

condition specified in the rule:• Send to Fail• Send to Pass• Send to both

• Optionally, you can specify what value to insert as a substitute for a failed value with the For Pass, Substitute with option.

Note: This option only applies if:• The column value failed the validation rule• Send to Pass or Send to both options are selected


You can also create a custom Validation function and select it when you create a validation rule. For more information on creating a custom Validation functions, see “Validation Transform”, Chapter 12 in the Data Integrator Reference Guide. See the Data Integrator Release Summary if you are using the January 2004 Data Integrator release build.

Data Input

One source in a data flow.

Options

The Validation transform editor offers several options:• Enable Validation: this option turns the validation rule on and off.• Do not validate when NULL: select this option if you want Data Integrator

to send all NULL values to the Pass output. Data Integrator will not apply the validation rule on this column when an incoming value for it is NULL. Data Integrator assumes that the validation rule has succeeded.

• Condition. The following conditions are available:• In: this condition allows you to specify a range of values for a column.• Between: this condition allows you to specify a range of values for a

column.• Custom Validation function: this condition allows you to select a

function from a list for validation purposes. Data Integrator only supports Validation functions that take one parameter and return an integer data type. If a return value is not a zero, then Data Integrator processes it as TRUE.

Note: Only custom functions can be made Validation functions. To set the Designer to display such functions, select the Validation function check box on a function’s Properties window when you create a new custom function.

• Exists in table: this option allows you to specify that a column’s value must exist in another table's column. This option also uses the LOOKUP_EXT function. You can define the NOT NULL constraint for the column in the LOOKUP table to ensure the Exists in table condition executes properly.

• Custom condition: this condition allows you to create more complex expressions using its buttons to the function and smart editors.

• Action on Fail: you define the action on a failure to your validation rule using one of the actions listed. Using these actions you can specify what Data Integrator will do with a row of data when a column value fails to meet a condition. You can send the row to the Fail target, to the Pass target or to both. If you choose Send to Pass or Send to Both, you can choose to substitute a value or expression for the failed values that are sent to the Pass output. You can choose to substitute these values with a value or expression using the smart editor provided. For example, you might choose to insert the correct value.On submit, Data Integrator converts substitute values to a corresponding column data type: integer, decimal, varchar, date, datetime, timestamp, or


time. The Validation transform requires that you enter some values in specific formats:

If, for example, you specify a date as 12-01-2004, Data Integrator produces an error because you must enter this date as 2004.12.01.

Data Output

The Validation transform outputs up to two different data sets based on the validation condition you specify.

You can load pass and fail data into multiple targets. The Pass output schema is identical to the input schema. Connect the output of the Validation transform with one (Pass or Fail) or two (for example, Pass and Fail) objects in the workspace. Select a Validation pass or fail label from a pop-up menu when you connect these:

Data Integrator adds the following two columns to the Fail output schemas:• The DI_ERRORACTION column indicates where failed data was sent in

this way:• The letter B is used for sent to both Pass and Fail outputs”• The letter F is used for sent only to the Fail output.

Note: If you choose to send failed data to the Pass output, Data Integrator does not track the results. You may want to substitute a value for failed data that you send to the Pass output because Data Integrator does not add columns to the Pass output. The input schema is maintained in the Pass output

• The DI_ERRORCOLUMNS column displays all error messages for columns with failed rules. The names of input columns associated with each message are separated by colons. For example, “<ValidationTransformName> failed rule(s): c1:c2”).Note: If a row has conditions set for multiple columns and the Pass, Fail,

and Both actions are specified for the row, then the precedence order is Fail, Both, Pass. For example, if one column’s action is Send to Fail and the column fails, then the whole row is sent only to the Fail output. Other actions for other validation columns in the row are ignored.

date YYYY.MM.DD

datetime YYYY.MM.DD HH24:MI:SS

time HH24:MI:SS

timestamp YYYY.MM.DD HH24:MI:SS.FF


To create a validation rule1 Drag your source table and Validation transform onto the workspace.2 Connect the input source table to the Validation transform.3 Double-click the Validation transform to open the editor.4 Click to highlight an input schema column, and select the Enable

Validation check box.5 Under Conditions, select a condition.

Note: All conditions must be Boolean expressions. You can also create a custom conditions using Data Integrator built-in functions. Built-in functions are discussed in the next Lesson 7 “Using built-in functions”.

6 Under Action on Failure, select an action. Note: If you choose Send to Pass or Send to Both, you can choose to

substitute a value or expression for a failed value that is sent to the Pass output.


Activity: Using the Validation transform

ObjectivesIn this activity, we will move all customers from a source table to template tables. Based on the identified business rules we will use the Validation transform to create two target template tables.

We want to:• Move all 91 customers to one target template table and rename the FAX

column null values to unknown• Separate and move only those customers who null values for the FAX

column into a second target table

Instructions1 Create a datastore called Northwind that connects to the Northwind

sample database that comes with MS SQL Server and import the metadata. Note: Use the default sa for both SQL user name and password.

2 Create a job called Validation_Job.3 Create a data flow called Validation_DF.4 Add the Customers table from the Northwind datastore as a source table

to Validation_DF.5 Click the magnifying glass on the CUSTOMERS table to observe the data

in the table. You will notice that some of the customers fax number are NULL:

6 Go back to the Validation_DF workspace.7 From the Designer, under the Datastores tab, expand the Northwind

Datastore, and drag the Template table onto the workspace to create two template tables.

Practice


Name one template table FAX and the other one HAS_ NO_FAX:

Note: We will use these template tables to copy all 91 customers to the FAX table and rename NULL values in the Customers table to UNKNOWN in the FAX table. We will also copy only those customers who have a NULL for their fax number into the /Validation_DF/Validation_1_Fail_HAS_NO_FAX table.

8 Add the Validation transform to the workspace, and connect the CUSTOMERS source table to the Validation transform.

9 Double-click the Validation transform to open the Editor.

Using the Fax column from your input schema, create a Validation rule with the Condition IS NOT NULL:

10 To see the data for both target tables, select Send to both in the Action or Failure option for your validation rule.


11 For any rows that Pass the criteria, substitute the value NULL with UNKNOWN. Type ‘UNKNOWN’ in the editor box:

12 Go back to the Validation_DF level, and connect the Validation transform to the target tables:• Select Pass for the FAX target table.• Select Fail for the HAS_NO_FAX target table.

13 Associate the Validation_Job to the Exercises project and execute the job.You should see this message if your job executes successfully:

On the monitor tab you can see that all the customers from the source table are moved: /Validation_DF/CUSTOMER' to the template table '/Validation_DF/Validation_1_Pass_FAX.The original CUSTOMERS source table has 91 customers. All 91 customers have been moved to the FAX table, but the ones with a NULL value are renamed to UNKNOWN. You sent those customers with a NULL for their fax number into the HAS_NO_FAX table. The total customers for HAS_NO_FAX table is 22.

14 Go back to the Validation_DF level, and click the magnifying glass on the CUSTOMERS and FAX tables. You should see that the source table, which consists of NULLS, has now a substituted value of UNKNOWN:

15 Close the CUSTOMERS and FAX tables.16 View data for the HAS_NO_FAX table.

You should see two extra columns:• DI_ERRORACTION indicates the error actions.• DI_ERRORCOLUMNS indicates the validation rules which failed.\


Date_Generation transformThe Date_Generation transform is ideal for creating time dimension tables. It produces a series of dates incremented as specified by you.

You use this transform to produce the key values for a time dimension target. From this generated sequence you can populate other fields in the time dimension (such as day_of_week) using functions in a query. There are no data inputs for this transform.

Options

The Date_Generation transform offers several options:• Start date: The first date in the output sequence. You specify this date

using the YYYY.MM.DD format.• End date: The last date in the output sequence. Use the same format

used for Start date to specify this date.• Increment: The interval between dates in the output sequence. Select

Daily, Monthly, or Weekly.• Join rank: A positive integer indicating the weight of the output data set if

the data set is used in a join. Sources in the join are accessed in order based on their join ranks. The highest ranked source is accessed first to construct the join.

• Cache: Used to hold the output from the transform in memory so it may be used in subsequent transforms.

Data Output

A data set with a single column named DI_GENERATED_DATE containing the date sequence. The rows generated are flagged as INSERT.

The Date_Generation transform does not generate hierarchical data. Generated dates can range from 1900.01.01 through 9999.12.31.


Understanding nested data

Real-world data often has hierarchical relationships that are represented in a relational database with master-detail schemas using foreign keys to create the mapping. However, some data sets, such as XML documents, handle hierarchical relationships through nested data.

After completing this unit, you will be able to:• Understand hierarchical data representation• Explain what nested data is• Import metadata from XML documents

You can represent the same hierarchical data in several ways:• Multiple rows in a single data set. For example:

• Multiple data sets related by a join. For example:

Introduction

Understanding hierarchical data representation


• Nested data. For example:

Sales orders are often presented using nested data. For example, the line items in a sales order are related to a single header and are represented using a nested schema. Each row of the sales order data set contains a nested line item schema as shown:

Using the nested data method can be more concise (no repeated information), and can scale to present a deeper level of hierarchical complexity.

Explaining what nested data is


To expand on the example above, columns inside a nested schema can also contain columns.

There is a unique instance of each nested schema for each row at each level of the relationship as shown:

Generalizing further with nested data, each row at each level can have any number of columns containing nested schemas.

Data Integrator maps nested data to a separate schema implicitly related to a single row and column of the parent schema. This mechanism is called Nested Relational Data Modelling (NRDM). NRDM provides a way to view and manipulate hierarchical relationships within data flow sources, targets, and transforms.

In Data Integrator you can see the structure of nested data in the input and output schemas of sources, targets, and transforms in data flows.


Data Integrator allows you to import and export metadata for XML documents that you can use as sources or targets in jobs.

XML documents are hierarchical and their valid structure is stored in separate format documents. The format of an XML file or message (.xml) can be specified using either a document type definition (.dtd) or XML Schema for example, an .xsd file.

A DTD format or XML Schema describes the data schema of an XML message or file. Data flows can read and write data to messages or files based on a specified DTD format or XML Schema. You can use the same DTD format or XML Schema to describe multiple XML sources or targets.Note: Data Integrator supports W3C XML Schemas Specification 1.0. XML.

Data Integrator uses Nested Relational Data Modeling (NRDM) to structure imported metadata from format documents, such as .xsd or .dtd files, into an internal schema to use for hierarchical documents.

Importing metadata from a DTD fileFor an XML document that contains, for example, information to place a sales order, such as order header, customer, and line items, the corresponding DTD includes the order structure and the relationship between the data as shown:

You can import metadata from either an existing XML file (with a reference to a DTD) or a DTD file. If you import the metadata from an XML file, Data Integrator automatically retrieves the DTD for that XML file.

When importing a DTD format, Data Integrator reads the defined elements and attributes, and ignores other parts, such as text and comments, from the file definition. This allows you to modify imported XML data and edit the data type as needed.

Importing data from XML documents


To import a DTD format1 From the Local object library, click the Formats tab,2 Right-click DTDs, and select New.

The Import DTD Format editor displays:

3 In the DTD definition name field, enter the name you want to give the imported DTD format.

4 Beside the File name field, click Browse, and locate the file path that specifies the DTD you want to import.

5 In the File type area, select a file type. The default file type is DTD.Note: Use the XML option if the DTD file is embedded within the XML

data.6 In the Root element name file, select the name of the primary node of the

XML that the DTD format is defining.Note: Data Integrator only imports elements of the format that belong to

this node or any sub nodes. This option is not available when you select the XML file option type.

7 In the Circular level field, specify the number of levels the DTD, if applicable.Note: If the DTD format contains recursive elements, for example,

element A contains B and element B contains A, this value must match the number of recursive levels in the DTD format’s content. Otherwise, the job that uses this DTD format will fail.

8 In the Default varchar size, set the varchar size to import strings into Data Integrator.The default varchar size is1024.

9 Click OK.

After you import the DTD format, you can view the DTD format’s column properties, and edit the nested table and column attributes in the DTD -XML Format editor. For more information on DTD attributes, see “DTD” in “Data Integrator Objects”, Chapter 2 in the Data Integrator Reference Guide.


To view and edit column properties and column attributes of nested schemas1 In the Formats tab, expand DTDs, and double-click the DTD name. 2 In the DTD - XML Format editor, double-click a nested column or column,

and select Attributes.3 In the Column Attributes window, click Attributes, and click the attribute

name.4 In the value field, edit the attribute.


Importing metadata from an XML schemaFor an XML document that contains, for example, information to place a sales order, such as order header, customer, and line items, the corresponding XML schema includes the order structure and the relationship between the data as shown:

When importing an XML Schema, Data Integrator reads the defined elements and attributes, and imports:• Document structure• Table and column names• Data type of each column• Nested table and column attributesNote: While XML Schemas make a distinction between elements and

attributes, Data Integrator imports and converts them all to nested table and column attributes. For more information on Data Integrator attributes, see XML Schema” in “Data Integrator Objects”, Chapter 2 in the Data Integrator Reference Guide.


To import an XML schema1 From the Local object library, click the Formats tab.2 Right-click XML Schemas, and select New.

The Import XML Schema Format editor displays:

3 In the Format name field, enter the name you want to use for the format.4 In the File name/ URL field, enter the file name of the XML Schema or its

URL address.5 In the Root element name field, select the name of the primary node you

want to import.Note: Data Integrator only imports elements of the XML Schema that

belong to this node or any subnodes.If the root element name is not unique within the XML Schema, select a namespace to identify the imported XML Schema.

6 In the Circular level field, specify the number of levels the XML schema has, if applicable.Note: If the XML Schema contains recursive elements, for example,

element A contains B and element B contains A, this value must match the number of recursive levels in the XML Schema’s content. Otherwise, the job that uses this XML Schema will fail.

7 In the Default varchar size, set the varchar size to import strings into Data Integrator.The default varchar size is1024.

8 Click OK.

After you import an XML Schema, you can view the XML schema’s column properties, and edit the nested table and column attributes in the XML Schema Format editor.


To view and edit column properties and column attributes of nested schemas1 In the formats tab, double-click the XML schema name.2 In the XML Schema Format editor, double-click a nested column or

column, and select Attributes. Note: For more information on data types that Data Integrator uses when

it imports the XML document metadata, see XML Schema” in “Data Integrator Objects”, Chapter 2 in the Data Integrator Reference Guide.


Understanding operations on nested data

Nested data included in the input of a transform, with the exception of the Query transform, passes through the transform without being included in the transform’s operation.

With nested data, only the columns at the first level of the input data set are available for transformation. Therefore, to transform values in nested schemas for input in relational tables, you must use the Query transform to unnest the data, perform the transformation, then load the data into a target relational table. Note: After unnesting the data you can construct the nested schema again to

load into a nested schema target. For the purpose of this course, we focus only on loading XML data into a relational target. For more information, see “To construct a nested data set” under “Nesting columns”, Chapter 9 “Nested data” in the Data Integrator Designer Guide.

After completing this unit, you will be able to:• Explain uses of nested data and the Query transform• Unnest data• Use the XML_Pipeline in a data flow

With relational data, the Query transform allows you to execute a SELECT statement, where the mapping between input and output schemas in the Query transform editor defines the projection list for the SELECT statement. The Query transform assumes that the FROM clause in the SELECT statement contains the data sets that are connected as inputs to the Query object.

When you work with nested data, the Query transform provides an interface to perform SELECT statements at each level of the relationship that you define in the output schema, but you must explicitly define the FROM clause in the Query editor. Note: Data Integrator assists by setting the top level inputs as the default

FROM clause values for the top-level output schema.

The remaining SELECT statement elements defined by the Query transform work the same with nested data as they do with flat data. However, because a SELECT statement can only include references to relational data sets, a Query transform that has nested data includes a SELECT statement to define operations for each parent and child schema in the output.

Introduction

Explaining uses of nested data and the Query transform


The parameters you enter apply only to the current schema, as displayed in the Output list in the diagram below:

FROM clause constructionWhen you include a schema in the FROM clause, you indicate that all of the columns in the schema—including columns containing nested schemas—are available to be included in the output.

If you include more than one schema in the FROM clause, you indicate that the output will be formed from the cross product of the two schemas, constrained by the WHERE clause for the current schema.

These FROM clause descriptions and the behavior of the query are exactly the same with nested data as with relational data.

The current schema allows you to distinguish multiple SELECT statements from each other within a single query. However, determining the appropriate FROM clauses for multiple levels of nesting can be complex because the SELECT statements are dependent upon each other.

A FROM clause in a nested schema can contain:• Any top-level schema from the input• Any schema that is a column of a schema in the FROM clause of the

parent schema


The FROM clause forms a path that can start at any level of the output, but the first schema in the path must always be a top-level schema from the input:

Note: The data that a SELECT statement from a lower schema produces differs depending on whether or not a schema is included in the FROM clause at the top-level.

The next two examples use the sales order data set to illustrate scenarios where FROM clause values change the data resulting from the query.

FROM clause that includes all top-level inputsJoining the order schema at the top-level with a customer schema, the output can include detailed customer information for all of the orders in the data set. Including both input schemas in the FROM clause at the top-level contents produces the appropriate data.


Lower level FROM clause that contains top-level inputIf instead the input includes a material schema, you would define the join constraint for the line-item schema so that the detailed material information appears for each row in the nested data set. The FROM clause for the line-item schema would include the material schema (top level) and the line-item schema.

For the line-item schema to be available in the FROM clause, the order schema would have to be included in the FROM clause of the top-level schema.


Loading a data set that contains nested schemas into a relational target requires that the nested rows be unnested.

For example, a sales order may use a nested schema to define the relationship between the order header and the order line items. To load the data into relational schemas, the multi-level must be unnested.

Unnesting a schema produces a cross-product of the top-level schema (parent) and the nested schema (child).

You can also load different columns from different nesting levels into different schemas. For example, a sales order can be flattened so that the order number is maintained separately with each line-item and the header and line-item information are loaded into separate schemas.

Data Integrator allows you to unnest any number of nested schemas at any depth. No matter how many levels are involved, the result of unnesting schemas is a cross product of the parent and child schemas.

When more than one level of unnesting occurs, the inner-most child is unnested first, then the result—the cross product of the parent and the inner-

Unnesting data


most child—is then unnested from its parent, and so on to the top-level schema.

Keep in mind that unnesting all schemas to create a cross product of all data might not produce the results you intend. For example, if an order includes multiple customer values such as ship-to and bill-to addresses, flattening a sales order by unnesting customer and line-item schemas produces rows of data that might not be useful for processing the order.

To unnest data1 After creating the appropriate data flow, double-click the Query editor to

open it. Note: This assumes the data flow contains a nested data source, a

Query transform, and a relational target table. 2 In the Query editor, drag the required nested schema from the Schema In

pane to the Schema Out pane. 3 In the Schema Out pane, select a top-level schema, and right-click. 4 From the drop-down menu, select Make Current to activate all columns

contained within the nested schema.5 Highlight the top-level schema again, right-click and select Unnest.


The output of the query (the input to the next step in the data flow) includes the data in the new relationship.

Note: Data for columns or schemas that you do not need might be more difficult to filter out after the unnesting operation. Thus, you can use the Cut command to remove columns or schemas from the top level. To remove nested schemas or columns inside nested schemas, make the nested schema the current schema, and then cut the unneeded columns or nested columns.


Activity: Populating the Material dimension from an XML file

ObjectivesIn this activity you will:• Import a document type definition file (DTD) to define and extract data

from an XML file.• Create a job and define the data flow using the Query transform to unnest

(flatten) selected elements from it to populate the denormalized material dimension table.

Instructions

Importing the DTD file1 In the Designer Local object library, click the Formats tab.2 Right-click DTDs, and select New. 3 In the DTD definition name field, type Mtrl_List 4 Beside the File name field, click Browse to navigate to the mtrl.dtd file in

your Data Integrator Program Files directory at \Tutorial Files\mtrl.dtd, and open it.

5 Leave the default DTD file option.6 In the Root element name list, select MTRL_MASTER_LIST.7 Leave the default circular level and default varchar size values. 8 Click OK.

Creating the job and the data flow1 In the Designer tab of the Project area, under the Exercises folder, create

a job called MtrlDim_Job.2 Add a work flow to MtrlDim_Job called MtrlDim_WF.3 Add a data flow to MtrlDim_WF called MtrlDim_DF.4 Open MtrlDim_DF in the workspace.5 From the Local object library, click the Formats tab, and expand DTDs.6 Drag the Mtrl_List DTD file into the data flow workspace, and select Make

XML File Source.7 Double-click Mtrl_List to open it.8 On the Source tab for XML file, click Browse to import the mtrl.xml file in

your Data Integrator Program Files directory at Tutorial Files\mtrl.xml.9 Open mtrl.xml.10 Select Enable Validation.

Note: Enabling validation turns on the comparison of the incoming data to the stored XML Schema or DTD format.

11 In the Designer tab of Project area, click MtrlDim_DF to return to the data flow level.

Practice


12 Add a Query transform to the data flow, and name it Qryunnest.13 From Local object library, click the datastores tab, and expand

Target_DS.14 Expand Tables, add MTRL_DIM to the right of Qryunnest MTRL_DIM,

and select Make Target.15 Connect the source, Query transforms, and target.

Defining Qryunnest1 Double-click Qryunnest to open it.

Notice the nested structure of the source in the Schema In pane. 2 Try dragging an individual column across to one of the columns in the

Schema Out pane. Note that you cannot map to a higher level in the hierarchy.

3 Look at the differences in column names and data types between the input and output schemas. Tip: Use the scroll bar to view data types.

4 Look at the Schema In and Schema Out panes.Notice that the Schema In pane contains more columns than required by the desired Schema Out pane and the columns that match both panes (eg. MTRL_ID) require flattening to fit the target table.

5 In the Schema Out pane, click MTRL_ID.6 Press SHIFT, and click DESCR.

This should highlight all the columns in the Schema Out pane.7 Right-click the highlighted columns, and select Cut.

You are cutting, not deleting, these columns in order to capture the correct column names and data types from the target schema. You will later paste these columns into the appropriate nested context from the source. This way you avoid having to edit the source column names and data types.

8 From the Schema In pane, click the MTRL_MASTER schema, and drag it into the Schema Out pane.The MTRL_Master Schema should now display in the Schema Out pane.

9 In the Schema Out pane, right-click MTRL_MASTER, and select Make Current to make the schema available for editing.

10 Click MTRL_ID, select the remaining columns down to and including the HAZMAT_IND nested schema, right-click, and select Delete.

11 Right-click MTRL_MASTER, and select Paste, to paste back the columns you cut from the target table earlier.


The Schema Out pane should look like this:

12 Remap these columns from the Schema In to the Schema Out pane:

13 The DESCR target column needs to be mapped to SHORT_TEXT, which is located in a nested schema:• In the Schema Out pane, right-click DESCR, and select Cut to

capture the DESC column name and data type.• In the Schema Out pane, right-click the TEXT schema, and select

Make Current. • In the Schema Out pane, under the TEXT schema, right-click

LANGUAGE.• Select Paste, and select Insert Below to place the DESCR column

at the same level as the SHORT_TEXT column in the Schema In pane.

• From the Schema In pane, remap the SHORT_TEXT column to the DESCR column in the Schema Out pane.

The resulting Schema Out pane should look like this:

14 In the Schema Out pane, highlight LANGUGE, SHORT_TEXT and TEXT_nt_1, and select Delete.

MTRL_ID MTRL_ID

MTRL_TYPE MTRL_TYP

IND_SECTOR IND_SECTOR

MTRL_GROUP MTRL_GRP


15 In the Designer tab of the Project area, click MTRL_DIM to view its schema.Notice that there are two nested levels (MTRL_MASTER and TEXT) in the Schema In pane; thus, you will not be able produce the flat schema required by the target.

16 In the Designer tab of the Project area, click Qryunnest again to view its schema.

17 In the Schema Out pane, right-click TEXT, and select Unnest.Note: An arrow appears on the TEXT schema.

18 In the Designer tab of the Project area, click MTRL_DIM to open it.Notice that now you have one nested level (MTRL_MASTER), and are still not able to produce the flat schema required by the target.

19 In the Designer tab of the Project area, click Qryunnest. 20 In the Schema Out pane, right-click MTRL_MASTER, and select Make

Current.21 Right-click MTRL_MASTER again, and select Unnest.

The columns in the Qryuunest editor Schema Out pane should look like this:

22 In the Designer tab of the Project area, click MTRL_DIM to view the schema again.Both the Schema In and Schema Out panes are now flat.

23 From the Project menu, click Save All.24 Go back to the MtrlList_DF level and validate the data flow.25 Execute the job.

The job should execute successfully without errors.


XML_PipelineThe XML_Pipeline is used to process large XML files one instance of a specified repeatable structure at a time. With this transform, Data Integrator does not need to read the entire XML input into memory and build an internal data structure before performing the transformation.

This means that an NRDM structure is not required to represent the entire XML data input. Instead, this transform uses a portion of memory to process each instance of a repeatable structure, then continually releases and reuses the memory to continuously flow XML data through the transform.

During execution, Data Integrator pushes operations of the streaming transform to the XML source. Therefore, you cannot use a breakpoint between your XML source and a XML_Pipeline.Note: You can use the XML_Pipeline to load into a relational or nested

schema target.This course focuses on loading XML data into a relational target. For more information on constructing nested schemas for your target, see “To construct a nested Schema” under “Nesting columns”, Chapter 9 “Operations on nested data” in the Data Integrator Designer Guide.

Data input

You can use an XML file or XML message. You can also connect more than one XML_Pipeline to an XML source.

Options

The XML_Pipeline is streamlined to support massive throughput of XML data; therefore, it does not contain additional options other than Input and Output schemas, and the Mapping tab. Note: The Mapping tab shows how Data Integrator will map any selected

output column.

When connected to an XML source, the XML_Pipeline editor shows the input and output schema structures as a root schema containing repeating and non-repeating sub-schemas represented by these icons:

Keep in mind these rules when using the XML_Pipeline:• You cannot drag and drop the root level schema.

Using the XML_Pipeline transform in a data flow

Root schema and repeating sub-schema

Non-repeating sub-schema


• You can drag and drop the same child object repeated times to the output schema, but only if you give each instance of that object a unique name. Rename the mapped instance before attempting to drag and drop the same object to the output again.

• When you drag and drop a column or sub-schema to the output schema, you cannot then map the parent schema for that column or sub-schema. Similarly, when you drag and drop a parent schema, you cannot then map an individual column or sub-schema from under that parent.

• You cannot map items from two sibling repeating sub-schemas because the XML_Pipeline does not support cartesian products (combining every row from one table with every row in another table) of two repeatable schemas.

To take advantage of the XML_Pipeline’s performance, always select a repeatable column to be mapped. For example, if you map a repeatable schema column, purchaseOrders.purchaseOrder.items.item, the XML source produces one row after parsing one item:

Avoid selecting non-repeatable columns that occur structurally after the repeatable schema because the XML source must then assemble the entire structure of items in memory before processing. Selecting non-repeatable columns that occur structurally after the repeatable schema increases memory consumption to process the output into your target.

To map both the repeatable schema and a non-repeatable column that occurs after the repeatable one, use two XML_Pipelines, and use the Query transform to combine the outputs of the two XML_Pipeline transforms and map the columns into one single target.


Activity: Defining the XML_Pipeline in a data flow

ObjectivesIn this activity you will use an XML data source to create a relational database table.

Scenario

We have a purchase order source with repeatable purchase order and items. This same source also contains a non-repeated Total Purchase Orders column.

The business rule provided to you requires that you combine the Customer Name, Order Date, items and totalPOs into one single relational target table. The target table should display 1 row per customer per item.

In this activity you will:• Use the XML_Pipeline to move XML data into a target relational table. • Create a job and define a data flow that combines the rows required from

both XML sources into a single relational target table

Instructions1 From the Resource CD, copy these files to the C:/Temp drive:

• purchaseOrders.xsd• pos.xml

2 In the Local object library, click the Formats tab.3 Right-click XML Schemas, and select New.4 Enter the following information for the XML format file:

• Format name: PurchaseOrders• Filename /URL: Browse to C:\Temp and point to purchaseOrders.xsd• In the Root element field, select purchaseOrders from the drop-

down list.Note: You do not need to enter a namespace. Leave the default values

for the Circular level and the Default varchar size.5 In the Designer tab of the Project area, create a job called

Pipeline_PO_Job.6 Create a data flow called Pipeline_PO_DF.7 Double-click Pipeline_PO_DF to open it, and from the Formats tab,

expand XML Schemas.8 Drag PurchaseOrders onto the Pipeline_PO_DF workspace, and select

Make XML File Source.9 In the workspace, double-click PurchaseOrders to open it.10 Beside the XML file field, click Browse, and browse to C:/Temp drive.11 Double-click pos.xml to open it.

Practice


12 Go back to the data flow level.13 From the Transforms tab, drag two XML_Pipeline transforms onto the

data flow workspace and connect them to PurchaseOrders:

14 Double-click the top XML_Pipeline to open it.15 From the Schema In pane, drag these columns into the Schema Out

pane:• customerDate• orderDate

16 From the Schema In pane, drag the repeatable item schema to the Schema Out pane.

Note: The repeatable item schema is represented by this icon .The mapping should look like this:

Note: After it is dragged to the Output pane, the item schema should now have a non-repeating icon beside it.

17 Click the back arrow to return to the data flow level.18 Double-click the bottom XML_Pipeline_1 transform.


19 From the Schema In pane, drag these columns into the Schema Out pane:• customerDate• orderDate• totalPOsThe mapping should look like this:

20 Click the back arrow to go back to the data flow level.21 Add a Query transform to the right of the XML_Pipeline transforms, and

connect the two XML_Pipeline to the Query transform:

22 Double-click the Query transform to open it.You will now use the Query transform to combine the outputs from both XML_Pipeline transforms to create one single output for loading into a relational target table.

23 From the Schema In pane drag these columns into the Schema Out pane:• customerName• orderDate• item • totalPOs


24 From the Schema Out pane, right-click items, and select unnest.Your mapping should look like this:

25 In the Query Editor, click the Where tab. 26 In the Where tab smart editor area, specify this join:

XML_Pipeline.customerName = XML_Pipeline_1.customerNameTip: Type XML in the smart editor area, and the values that start with

XML appear. Scroll down to find XML_Pipeline, and click on the customerName column as shown below:

Note: You must specify a join to avoid a Cartesian product that joins each item in XML_Pipeline_1 to every value of totalPOs. Without a join, this query returns a Cartesian product that produces a row for each combination of totalPOs and Partnum. By specifying a join you ensure that the query returns only 1 row per customer per item.

27 Go back to the data flow level.28 Add a target template table to the data flow, and name it ItemPOs.29 Connect the Query transform to ItemPOs.


30 Go back to the data flow level and validate it.31 Execute the job.

The job should execute successfully.32 In the data flow level, double-click the magnifying glass on ItemPOs.

You can see that all columns you mapped are now in the target table. The columns displayed should look like this:


Lesson Summary

Quiz: Using Built-in Transforms and Nested Data1 What is the Case transform used for?

2 Name the transform that you would use to combine incoming data sets to produce a single output data set with the same schema as the input data sets.

3 A validation rule consists of a condition and an action on failure. When can you use the action on failure options in the validation rule?

4 Which transform do you use to unnest nested data source for loading into a target table?

5 When would you use the XML_Pipeline transform?

Review


After completing this lesson, you are now able to:• Describe and use the most commonly used built-in transforms in a data

flow• Understand hierarchical data representation• Import metadata from XML documents• Describe how transforms handle nested data• Explain uses of nested data and the Query transform• Unnest data• Use the XML_Pipeline in a data flow

Summary


Lesson 7Using Built-in Functions

Data Integrator gives you the ability to use simple functions, such as aggregate functions, and also includes a set of specific functions to help you perform complex operations.

In this lesson you will learn about:• Defining built-in functions• Using functions in expressions• Using built-in functions

Duration: 2 hours


Defining built-in functions

Data Integrator supports built-in functions, custom functions, and imported functions.

After completing this unit, you will be able to:• Explain what a function is• Differentiate between functions and transforms• List the types of operations available for functions• List the types of functions you can use in Data Integrator

Functions take input values and produce a return value. Functions also operate on individual values passed to them.

Input values can be parameters passed into a data flow, values from a column of data, or variables defined inside a script.

You can use functions in expressions that include scripts and conditional statements..

Some functions can produce the same or similar values as transforms. However, functions and transforms operate in a different scope:• Functions operate on single values, such as values in specific columns in

a data set.• Transforms operate on data sets, creating, updating, and deleting rows of

data.

Note: Data Integrator does not support functions that include tables as input or output parameters, except functions imported from SAP R/3.

Introduction

Explaining what a function is

Differentiating between functions and transforms

Name Sequence Month

Joe 1 Jan

Feb

Mar

Jan

Feb

Mar

Jan

1100

2

3

1

Sid 2 1200

Sid 3 300

Dolly 1 900

Q1

Joe 500

900

500

Joe

Sid

Function

Transform

Using Built-in Functions—Learner’s Guide 7-3

The function’s operation determines where you can call the function.

For example, the lookup function operates as an iterative function since it can cache information about the table and columns it is operating on between function calls.

By contrast, conversion functions, such as to_char, operate as stateless functions because conversion functions operate independently in each iteration.

An aggregate function, such as max, requires a set of values to operate.

Neither the lookup function (iterative) nor the max function (aggregate) can be called from a script or conditional object where the context does not support how these functions operate.

For a description of all built-in Data Integrator functions, see “Functions and Procedures”, Chapter 6, in theData Integrator Reference Guide.

Below is a list of function operation types:

Listing the types of operations for functions

Operation Type Description

Aggregate Generates a single value from a set of values. Aggre-gate functions, such as max, min, and count.

Can be called only from within a Query transform—not from custom functions or scripts.

Iterative Maintains state information from one invocation to another. The life of an iterative function’s state informa-tion is the execution life of the query in which they are included.

Can be called only from within a Query transform—notfrom functions or scripts.

Stateless State information is not maintained from one invocation to the next. Stateless functions such as to_char or month can be used anywhere expressions are allowed.


Functions are also grouped into different categories:

In addition to built-in functions, you can also use these functions:• Database and application functions:

These functions are specific to your DBMS. You can import the metadata for database and application functions and use them in Data Integrator applications. At run time, Data Integrator passes the appropriate information to the database or application from which the function was imported.The metadata for a function includes the input, output and their data types. If there are restrictions on data passed to the function, such as requiring uppercase values or limiting data to a specific range, you must enforce these restrictions in the input. You can either test the data before extraction or include logic in the data flow that calls the function.You can import stored procedures from DB2, MS SQL Server, Oracle, and Sybase databases. You can also import stored functions and packages from Oracle. For more information on importing functions, see “Custom Datastores”, in Chapter 5, in the Data Integrator Reference Guide.

• Custom functions.These are functions that you define. You can create your own functions by writing script functions in Data Integrator Scripting Language. You will learn more about creating custom functions in Lesson 9, “Using Variables, Parameters and Scripting”.

Listing the types of functions

Group Functions

Aggregate avg, count, max, min, sum

Conversion interval_to_char, julian_to_date, num_to_interval, to_char, to_date, to_decimal,

Database key_generation, sql, total_rows

Date add_months, concat_date_time, date_diff, date_part, day_in_month,

day_in_week, day_in_year, fiscal_day, isweekend, julian, last_date, month,

quarter, sysdate, systime, week_in_month, week_in_year, year

Environment get_env, is_set_env, set_env

Math abs, ceil, floor, rand, round, trunc

Miscellaneous get_domain_description, ifthenelse, lookup, lookup_seq, lookup_ext, nvl,

dataflow_name, datastore_field_value, gen_row_num, host_name, job_name,

raise_exception, raise_exception_ext, repository_name, sleep,

system_user_name, table_attribute, workflow_name, ll_error, ll_switch,

truncate_table, pushdown_sql

String index, length, lower, ltrim, print, replace_substr, rpad, rpad_ext, rtrim, rtrim_blanks, rtrim_blanks_ext, ltrim, ltrim_blanks, ltrim_blanks_ext, substr, upper, word, word_ext, WL_GetKeyValue

System exec, mail_to

Validation is_valid_date, is_valid_datetime, is_valid_decimal, is_valid_double, is_valid_int, is_valid_real, is_valid_time, file_exists


Using functions in expressions

Functions can be used in expressions to map return values as new output columns. Adding output columns allows columns that are not in the initial input data set to be specified in the output data set.

After completing this unit, you will be able to:• Use functions in expressions

Functions are typically used to add:• Columns based on some other value (lookup function)• Generated key fields

You can use functions in:• Transforms: for example: Query, Case, and SQL transforms.• Scripts: these are single-use objects used to call functions and assign

values to variables in a work flow.• Conditionals: these are single-use objects used to implement if/then/else

logic in a work flow. Conditionals and their components —if expressions, then and else diagrams —are included in the scope of the parent control flow’s variables and parameters.

• Other custom functions

Before you use a function, you need to know if the function’s operation makes sense in the expression you are creating.

For example, the max function cannot be used in a script or conditional where there is no collection of values on which to operate. Another example, parameters can be output by a work flow but not by a data flow.

You can add existing functions in an expression by using the Smart Editor or the Function Wizard. The Smart Editor offers you many options including available variables, data types, keyboard shortcuts and so on. The Function Wizard allows you to define parameters for an existing function and is recommended for defining complex functions.

Introduction

Using functions in expressions


To include a function in an expression using the Smart Editor1 Double-click the query, script or conditional that you want to include an

expression for.

2 In the Mapping tab, click .

The Smart Editor window opens.

3 Click the Functions tab, and expand a function category.4 Select and drag the specific function onto the workspace on the right.5 Enter the input parameters based on the syntax of your formula.6 Click OK.

The function you entered displays in the embedded Smart Editor...


To include a function in an expression using the Function Wizard1 Double-click the query, script or conditional that you want to include an

expression for.2 In the Mapping tab, click Functions.3 The Select Function window opens.

4 In the Function categories list, select a category.5 In the Function name list, select a specific function.

The functions shown depend on the object you are using. Clicking on each function separately also displays a description of the function below the list boxes.

6 Click Next.7 Enter the parameter values required by the function.

The input parameter definitions window is unique for each function.

8 Click to select input parameters.

9 Click Finish.The function and the parameters display in the embedded Smart Editor.


Using built-in functions

Data Integrator provides over 60 built-in functions. For this lesson we will only focus on the functions that you need in order to complete the activities in this lesson.

After completing this unit, you will be able to:• Use date and time functions and the date_generation transform to build a

dimension table• Use lookup functions to look up status in a table• Use match pattern functions to compare input strings to patterns in Data

Integrator• Use database type functions to return information on data sources

The built-in functions for date and time and built-in date_generation transform are useful when building a time dimension table.

to_charYou use the to_char function to convert a date to a string:

Syntax to_char(date1, format)

Return value

varchar —A formatted string describing date1.

Where

date 1— indicates the source date, time, or datetime value.

Format represents a string indicating the format of the generated string.

You can use these codes:

Introduction

Using date and time functions and the date_generation transform to build a dimension table

DD 2-digit day of the month

MM 2-digit month

MONTH Full name of month

MON 3-character name of month

YY 2-digit year

YYYY 4-digit year

HH24 2-digit hour of the day (0-23)


Exampleto_char(call_date, ‘dd-mon-yy hh24:mi:ss.ff’)

returns a date value from the call_date column formatted as a string, such as:

28-FEB-97 13:45:23.32Note: The hyphens and spaces format in the example are reproduced in the

actual result; all the other characters are recognized as part of a parameter string in the table above and substituted with appropriate current values.

to_dateYou use the to_date function to convert a string to a date.

Syntaxto_date(input_string, format)

Return Value

date, time, or datetime — A date, time, or both representing the original string.

Where

input_string represents the source string

format represents a string indicating the format of the source string

You can use these codes:

Exampleto_date('Jan 8, 1968', 'mon dd, yyyy')

results in 1968.01.08 stored as a date

MI 2-digit minute (0-59)

SS 2-digit second (0-59)

FF Up to 9-digit sub-seconds

DD 2-digit day of the month

MM 2-digit month

MONTH Full name of month

MON 3-character name of month

YY 2-digit year

YYYY 4-digit year

HH24 2-digit hour of the day (0-23)

MI 2-digit minute (0-59)

SS 2-digit second (0-59)

FF Up to 9-digit sub-seconds


julian You use the julian function to convert a date to its integer julian value, the number of days between the start of the julian calendar and the date.

Syntaxjulian(date1)

Return Value

int — The julian representation of the date.

Where

date1 indicates the source value of type date or datetime.

Examplejulian(to_date(’Apr 19, 1997’, ’mon dd, yyyy’))

returns a value of 729436

monthYou use the month function to determine the month in which the given date falls.

Syntaxmonth(date1)

Return Value

int — The number from 1 to 12 that represents the month component of date1.

Where

date1 represents the source date.

Examplemonth(to_date(’Jan 22, 1997’, ’mon dd, yyyy’))

results in 1

month(to_date(’3/97’, ’mm/yy’))

results in 3

quarterYou use the quarter function to determine the quarter in which the given date falls.

Synatxquarter(date1)

Return value

int — The number from 1 to 4 that represents the quarter component of date1.

Where

date1 represents the source date.

Examplequarter(to_date(’Jan 22, 1997’, ’mon dd, yyyy’))

results in 1


quarter(to_date(’5/97’, ’mm/yy’))

results in 2

Date_Generation transformThe Date_Generation transform is ideal for creating time dimension tables. It produces a series of dates incremented as specified by you.

You use this transform to produce the key values for a time dimension target. From this generated sequence you can populate other fields in the time dimension (such as day_of_week) using functions in a query. There are no data inputs for this transform.Note: You will have a chance to use this transform in an activity in the

upcoming lesson of this course.

Options

The Date_Generation transform offers several options:• Start date: the first date in the output sequence. You specify this date

using the YYYY.MM.DD format.• End date: the last date in the output sequence. Use the same format

used for Start date to specify this date.• Increment: the interval between dates in the output sequence. Select

Daily, Monthly, or Weekly.• Join rank: a positive integer indicating the weight of the output data set if

the data set is used in a join. Sources in the join are accessed in order based on their join ranks. The highest ranked source is accessed first to construct the join.

• Cache: used to hold the output from the transform in memory so it may be used in subsequent transforms.

Data Output

A data set with a single column named DI_GENERATED_DATE containing the date sequence. The rows generated are flagged as INSERT.

The Date_Generation transform does not generate hierarchical data. Generated dates can range from 1900.01.01 through 9999.12.31.


Activity: Populating the time dimension data flow

ObjectivesIn this activity you will:• Use the Date_Generation transform to specify a series of data

increments as indicated by the identified business requirements• Load this data into the TIME_DIM dimension table in your target

database

Instructions1 Create a new job and name it TimeDim_Job.

This job loads data into the time dimension table.2 Open TimeDim_Job and create a data flow named TimeDim_DF.

You will not create a work flow for this job.3 Open the data flow named TimeDim_DF.4 Select the Date_Generation transform and drag it onto the workspace.5 Add the Query transform and connect the Query Transform to the

Date_Generation transform.6 From the Target_DS datastore, drag the table TIME_DIM onto the

workspace and make it your target table.7 Connect the Query transform to the TIME_DIM table.8 Enter these values in the Date_Generation transform editor fields:

• Start Date: 1994.01.01• End Date: 2000.12.31• Increment daily

9 Double-click the Query transform and drag the generated_date column from the input schema to the NativeDate output column and select Remap Column.

10 Double-click each column name in the Schema Out column and use the expressions in the table to type the mapping expression in the column properties window to define each column mapping:

Practice

Column name Mapping Function Description

Date_ID julian(di_generated_date) Use the julian function to set the julian date for that date value

MonthNum month(di_generated_date) Use the month function to set the month number for that date value


Note: Assume that the business year is the same as the calendar year.

11 Click the back arrow in the toolbar to return to the data flow level. These columns are now the input schema for the TIME_DIM table.

12 Save and execute the job. 13 Use View data to check the contents in TIME_DIM table.

BusQuarter quarter(di_generated_date) Use the quarter func-tion to set the quarter for that date value

YearNum(Enclose YYYY in single quotes.)

to_char(di_generated_date,'YYYY')

Use the to_char function to select only the year out of the date value

Column name Mapping Function Description


The lookup, lookup_seq, and lookup_ext functions all provide: • A specialized type of join, similar to an SQL outer join

• A SQL outer join may return multiple matches for a single record in the outer table

• Lookup functions always return exactly the same number of records that are in the source (outer) table.

• Sophisticated caching options• A default value when no match is foundNote: The lookup_ext function is based on the lookup function and provides

enhanced usage features. It is recommended that you use lookup_ext if you are using Data Integrator XI or later. For this purpose, we are only going to cover the lookup_ext() and Lookup_seq() functions in this section. For more information on the Lookup() function, see “Functions and Procedures”, Chapter 7 it the Data Integrator Reference Guide.

Lookup functions return one row for each row in the source:

While all the lookup functions return one row for each row in the source they differ by how they choose which of several matching rows to return:• Lookup_ext()

Allows specification of an Order by column and Return policy (Min, Max) to return the record with the highest/lowest value in a given field. For example, a surrogate key.

• Lookup_seq() Searches in matching records to return a field from the record where the the sequence column, for example, effective_date, is closest to but not greater than a specified sequence value (for example, a transaction date).

We will look at these functions in more detail in the next sections.

Use lookup functions to look up status in a table

Emplyee

EmployID

FirstName

LastName

The source table contains

these columns:

The target table contains

these related columns:

The lookup function uses a third table that translates the values

from the source table to the values you want in the target table:

Finance

EmployID

Salary

SalaryReport

EmployName

Salary

Finance

EmployID

Salary

SalaryReport

EmployName

Salary


lookup_ext You can use this function to retrieve a value in a table or file based on the values in a different source table or file. This function also extends functionality by allowing you to:• Return multiple columns from a single lookup.• Choose from more operators to specify a lookup condition.• Specify a return policy for your lookup.• Perform multiple (including recursive) lookups.• Call lookup_ext in scripts and custom functions. This also lets you reuse

the lookup(s) packaged inside scripts.• Define custom SQL, using the SQL_override parameter, to populate the

lookup cache, narrowing large quantities of data to only the sections relevant for your lookup(s).

• Use lookup_ext to dynamically execute SQL.• Call lookup_ext, using the Function Wizard, in the query output mapping

to return multiple columns in a Query transform.• Design jobs to use lookup_ext without having to hard-code the name of

the translation file at design time.• Use lookup_ext with memory datastore tables.Tip: Use this function to the right of the Query Transform instead of to the

right of a column mapping. This allows you to select multiple output columns and go back to edit the function in the Function Wizard instead of manually editing the function’s complex syntax.

Syntaxlookup_ext ([translate_table, cache_spec, return_policy], [return_column_list], [default_value_list], [condition_list], [orderby_column_list], [output_variable_list], [sql_override])

Return Value

any type — The return type is the first lookup column in return_column_list.

Where

translate_table Represents the table, file, or memory datastore that contains the result(s) or value(s) you are looking up (result_column_list).

If the translate_table is a database table, use the datastore.owner.table format. For example:

ERP_ds.OWNER.EMPLOYEES

If the translate_table is a flat file, use the file_ds.file-name format. For example:

delim..c:/temp/employees.

cache_spec Represents the caching method the lookup_ext operation uses.You can choose from three settings: NO_CACHE, PRE_LOAD_CACHE or DEMAND_LOAD_CACHE


return_policy This optional parameter specifies whether the return columns should be obtained from the smallest or the largest row based on values in the orderby columns. The value for this parameter can be MIN or MAX, and when left blank, it defaults to MAX.

Use return_policy when you expect duplicate rows and want output data from the one of the selected rows.

return_column_list Is a comma-separated list containing the names of output columns in the translate_table.

default_value_list Is a comma-separated list containing the default expressions for the output columns. When no rows match the lookup condition, the default values are returned for the output column.

Each default expression type must be compatible with the corresponding output column type such that if the types are not exactly the same, automatic con-version is still possible.

If default_value_list is empty or has fewer expres-sions than the number of output columns, NULL will be used as default. You cannot have more default expressions than the number of output columns.

condition_list Is a list of triplets that specify lookup conditions.Each set in a triplet contains a compare_column, a compare_operator (<,<=,>, >=, =, !=, IS, IS NOT), and compare_expression.

The compare_column is from the translate_table. It is compared against compare_expression to com-pute the output row.

The compare_expression is written in terms of con-stants, variables, and columns in the calling data flow or scripts. While it cannot contain column refer-ence from the translate_table, it can be a simple constant, variable, or column reference, or a com-plex expression involving arithmetic operations and function calls.

Use compare operators IS and IS NOT to examine compare_column against the NULL constant. When you use IS or IS NOT as the compare operator, compare_expression must contain the NULL con-stant. When you compare other operators to a compare_expression containing a NULL, the lookup condition return value will always return FALSE.If you create more than one set of triplets, all triplets are implicitly combined together with AND to com-pute the final lookup condition.

EXAMPLE: [c1, .=.,10,c2,.<.,query.a,c3,.>=.,lower(query.name)]


orderby_column_list Is a comma-separated list of column names from the translate_table.

Working together with return_policy, the orderby_column_list is used to determine which duplicate row to return the output when more than one row satisfies the lookup condition.

When duplicate rows occur, the list of duplicates is sorted based on the columns from the orderby_column_list and returns a row using the MIN/MAX return_policy.

The orderby_column_list is optional. If left empty, the orderby columns match the output columns.

EXAMPLES:[c1,c3,c4] . Sort the duplicate rows using column values in c1, c3, c4.[ ] . Empty list is a placeholder for specifying subse-quent parameters.


output_variable_list Is a comma-separated list of output variables.

When more than one output column is specified in the function call, the output variables are used to receive output returns. Variables and output col-umns are matched by position. This parameter is optional unless more than one output column appears in the return_column_list. When this occurs, output variables must be equal in number to output columns.

To enable conversion, the variable data type must be compatible with the corresponding output col-umn. You do not need to specify output variables if the function is called using the Function Wizard to map output columns in the query window.EXAMPLE: [$a,$b,$c]

sql_override This parameter, available as a button called Custom SQL in the Function Wizard, must contain a valid, single-quoted SQL SELECT statement or a $vari-able of type varchar to populate the lookup cache when the cache specification is PRE_LOAD_CACHE. This parameter replaces the SQL SELECT statement generated internally by the function for populating the cache. The SELECT statement must select at least those columns refer-enced in return_column_list, condition_list, and orderby_column_list.

Any valid SQL SELECT statement is permitted and may contain references to other tables besides the translate_table to specify inner and outer joins. This parameter can only be specified when the translate_table is a database table.

If this parameter is specified when the cache specifi-cation is NO_CACHE, the sql_override query is exe-cuted each time the function is called.

If this parameter is specified when the cache specifi-cation is PRE_LOAD_CACHE, only the first sql_override query is executed to populate the lookup cache. All subsequent SQL statements are ignored after the lookup cache is built.

If this parameter is specified when the cache specifi-cation is DEMAND_LOAD_CACHE, the caching mode will be converted to PRE_LOAD_CACHE and behaves as if the PRE_LOAD_CACHE mode were specified.

EXAMPLE:[.select out1, out2,compare1,compare2,orderby1,orderby2 fromlookuptbl,othertbl where c1=10 and look-uptbl.c2=othertbl.c2.]


Example: LOOKUP REPLACEMENT

In this example we will retrieve the name of an employee whose empno is equal to 1.

These statements show how you can do this with both lookup and lookup_ext:

Lookup(ds.owner.emp, empname, .no body., .NO_CACHE., empno, 1);

Lookup_ext[(ds.owner.emp, .NO_CACHE.,.MAX.], [empname], [.no body.]. [empno, .=., 1]

Lookup_seqRetrieves a value in a table or file based on the values in a different source table or file and a particular sequence value.

Syntaxlookup_seq (translate_table, result_column, default_value, sequence_column, sequence_expression, compare_column, expression)

Return value

any type - The value in the translate_table that meets the lookup_seq

requirements. The return type is the same as result_column.

Where

translate_table The table or file that contains the result or value you are looking up (result_column).

Use a fully qualified table name, including the datastore, owner, and table name. For example: ERP_ds.OWNER.EMPLOYEES.

The translate_table is cached automatically for the operation of the function.

result_column The column containing the values you want to retrieve. Note: This column is in the translate_table.

default_value The value returned when there is no matching row in the translate_table.

sequence_column The column in the translate_table that indicates the sequence of the row.

This column often contains a date that indicates when new values were added to the row. For example, in some source tables, sequence_column is the EFFDT column, which indicates when the data in the row became effec-tive.


Example: LOOKUP_SEQ REPLACEMENT

In this example we will retrieve the name of an employee who works for department 20 and is the eldest employee, less than, or equal to 30 years old.

The following statements show how you can do this with both lookup_seq and lookup_ext:

Lookup_seq(ds.owner.emp, empname, .no body., age, 30, deptno, 20);

Lookup_ext[(ds.owner.emp, .PRE_LOAD_CACHE., .MAX.], [empname], [.no body.], [deptno, .=., 20, age, .<=., 30], [age]

Note: The return policy, MAX, is optional since MAX is already the default behavior.

sequence_expression The value the function searches for in the sequence_column to find a matching row. For example, if you are looking up values from a slowly changing dimension table and are inter-ested in only those rows in which the data is cur-rent as of today, you could use the return value from the sysdate function for sequence_expression.

compare_column The column in the translate_table that the function uses to find a matching row.

expression The value that the function searches for in the compare_column.

This can be a simple column reference, such as a column found in both a source and the translate_table. This can also be a complex expression given in terms of constants and input column references.

When expression refers to a unique source col-umn, you do not need to include a table name qualifier. If expression is from another table or is not unique among the source columns, you need a table name qualifier.


The match_pattern function allows you to compare input strings to simple patterns supported by Data Integrator. This function looks for a matching whole string and not substrings.

Syntax

match_pattern(input_string,pattern_string)

Return Value

Integer — returns 1 for a match, otherwise 0.

Where

input_string represents the data to be searched.

pattern_string represents the pattern you want to find in the input string.

Use these characters to create a pattern:

All other characters represent themselves. If you want to specify a special character as itself, then it has to be escaped.

This table displays pattern strings that represent example values:

Use match pattern functions to compare input string patterns

X Represents uppercase characters

x Represents lowercase characters

9 Represents numbers

\ Escape character

* Any characters occurring zero or more times

? Any single character occurring once and only once

[ ] Any one character inside the braces occurring once

[!] Any character except those after the exclamation point (for example, [!BOBJ] allows anything except BOBJ

Example Value Pattern String

Henrick Xxxxxxx

DAVID XXXXX

Tom Le Xxx Xx

Real-time Xxxx-xxxx

JJD)$@&*hhN8922hJ7# XXX)$@&*xxX9999xX9#

1,553 9999

0.32 9.99

-43.88 -99.99


Example

Use the Match_pattern function in the:• Validation transform editor under the condition section of the Validation

Rule tab• WHERE clause of a Query transform

The input string can be from sources such as columns, variables, or constant strings.

For example, in a Query WHERE clause:"WHERE MATCH_PATTERN(CUSTOMER.PHONE_NUM,'999-999-9999') <>

0"

*Jones Returns names with last name Jones

Henrick? Returns Henrick1 or HenrickZ

David[123] Returns David1 or David2 or David3

Example Value Pattern String


Activity: Using the Lookup_ext() Function

ObjectivesIn this activity you will:• Define the data flow that will generate the SalesFact table• Use the lookup_ext function to look up order status as identified by your

business requirements

Instructions

Define the data flow that will generate the sales fact table1 Create a new batch job called SalesFact_Job and add a work flow

named SalesFact_WF.2 Open SalesFact_WF.3 Add a the data flow and call it SalesFact_DF.4 Open SalesFact_DF.5 From the ODS_DS tables, select and add the ODS_SALESITEM table to

the workspace and make it a source table.6 Add the ODS_SALESORDER table to the workspace and also make it a

source table.7 Add a Query transform to the workspace.8 From the Target_DS datastore, select and add the SALES_FACT to the

workspace and make it a target table.9 Connect the source tables, Query transform and target table to indicate

the flow of the data.

Define the details of the query, including the join between source tables1 Open the Query transform Editor. 2 In the Where tab, select Propose Join.

The resulting relationship is:ODS_SALESORDER.SALES_ORDER_NUMBER= ODS_SALESITEM.SALES_ORDER_NUMBER

3 Add the following text to the Where tab to filter the sales orders by date:ANDODS_SALESORDER.ORDER_DATE >=to_date('2000.01.01','YYYY.MM.DD')ANDODS_SALESORDER.ORDER_DATE <=to_date('2000.12.31','YYYY.MM.DD')

This brings one year's sales orders into the target. You can drag and drop the column names from the input schema.

Practice


4 Map the following source columns to output fields:

Note: Do not remap the order_status column yet. You will map the order_status column with the lookup_ext function with the instructions in the next section.

Use a lookup_ext function for order status

Use the Sales_Fact table order status value (column ORDER_STATUS) from the Delivery table (column DEL_ORDER_STATUS).1 In the Query transform editor, click the Mapping tab, and under the

Schema Out column, select the ORD_STATUS.2 Click Functions to access the Function Wizard.3 In the Function Wizard, in the Lookup Function Categories, click Lookup

Functions.4 Select lookup_ext from the Function Name, and click Next.5 Click the arrow beside Translate Table, and select ODS_Delivery from

your ODS datastore to define the values for the lookup_ext()function:• Under Condition, click Table Column and select

DEL_SALES_ORDER_NUMBER.• Click Op. and select the = operator.• Click the ellipses beside Expression to access the Expression

Function Wizard.• In the Expression Function Wizard, click the Data tab, expand

ODS_SALESITEM, and drag SALES_ORDER_NUMBER into the text area on the right.

• Add the table column DEL_ORDER_ITEM_NUMBER.• Select the = operator.• In the Expression Function Wizard, click the Data tab, expand

ODS_SALESITEM, and drag SALES_LINE_ITEM_ID into the text area on the right.

• Under Output parameters, click Table Column, and select DEL_ORDER_STATUS.

TableSource Column

Data Type

DescriptionTarget Column (in order)

salesorder Sales_order_number

varchar(10) Sales order number

SLS_doc_no

Cust_ID varchar(10) Customer ID Cust_ID

Order_Date date or datetime(Sybase)

Order date SLS_doc_date

salesitem Sales_line_item_ID

varchar(6) Line item number

SLS_doc_line_no

Mtrl_ID varchar(18) Material ID Material_no

Price var-char(10)

Order item price

Net_value


• Click the ellipses beside Default value and type ‘N/A’.The Lookup_ext -Select Parameters window should look like this:

6 Click Finish.The lookup_ext function syntax appears in the Mapping tab and should look like this:lookup_ext([ODS_DS.DBO.ODS_DELIVERY, 'PRE_LOAD_CACHE','MAX'],[ DEL_ORDER_STATUS ],[ 'N/A' ],[ 0DS.DEL_SALES_ORDER_NUMBER,'=',ODS_SALESITEM.SALES_ORDER_NUMBER,DEL_ORDER_ITEM_NUMBER, '=',ODS_SALESITEM.SALES_LINE_ITEM_ID ])

7 Return to the SalesFact_DF level in the workspace.8 Validate all objects in view to verify that the description has been

constructed properly.If your design contains syntax errors, a dialog box appears with a message describing the error. Note: Warning messages are OK.

9 Add SalesFact_Job to the Exercises project in the Project area.10 Execute SalesFact_Job.

No error messages should appear in the status window. However, you may see a warning message indicating that a conversion from a date to datetime value occurred.


Use these functions to return information on your data source when using multiple sources.

db_typeThe db_type function returns the database type of the datastore configuration in use at runtime.

This function is useful if your datastore has multiple configurations. For example, you can use this function in a SQL statement instead of using a constant. This allows the SQL statement to use the correct database type for each job run no matter which datastore configuration is in use.

Syntaxdb_type(ds_name)

Return Value

varchar

The possible Db_type() return values for datastore types are:

Where

ds_name represents the datastore name that you entered when you created the datastore.

Example

If you have a SQL transform that performs a function that has to be written differently for database types, you can tell Data Integrator what to do if the database type is Oracle. In this example, the sql() function is used within a script:IF (db_type('sales_ds') = 'Microsoft_SQL_Server')

BEGIN

IF (db_version('sales_ds') = '2000')

$sql_text = '…';

ELSE

$sql_text = '…';

END

Sql('sales_ds', '{$sql_text}');

Using database functions to return information on data sources


db_versionThe db_version function returns the database version of the datastore configuration in use at runtime.

This function is useful if your datastore has multiple configurations. For example, you can use this function in a SQL statement instead of using a constant. This allows the SQL statement to use the correct database version for each job run no matter which datastore configuration is in use.

Syntaxdb_version(ds_name)

Return Value

varchar

Where


Example

If you have a SQL transform that performs a function that has to be written differently for different versions of SQL Server, you can tell Data Integrator which text to use for each database version. In this example, the sql() function is used within a script:IF (db_type('sales_ds') = 'Microsoft_SQL_Server')

BEGIN

IF (db_version('sales_ds') = '2000')

$sql_text = '…';

ELSE

$sql_text = '…';

END



db_database_nameThe db_database_name function returns the database name of the datastore configuration in use at runtime.

This function is useful if your datastore has multiple configurations and is accessing an Microsoft SQL Server or Sybase database. For a datastore configuration that is using either of these database types, you enter a database name, when you create a datastore. This function returns that database name.

For example, master is a database name that exists in every Microsoft SQL Server and Sybase database. However, if you use different database names, you can use this function in, for example, a SQL statement instead of using a constant. This allows the SQL statement to use the correct database name for each job run no matter which datastore configuration is in use.

This function returns an empty string for datastore configurations without Microsoft SQL Server or Sybase as the Database Type.

Syntaxdb_database_name(ds_name)

Return Value

varchar

Where


Example

If you have a SQL transform that performs a function that has to be written differently for different versions of database types, you can tell Data Integrator which text to use for each database version. In this example, the sql() function is used within a script.IF (db_type('sales_ds') = 'DB2')

$sql_text = '…';

ELSE

BEGIN

IF (db_type('sales_ds') =

'MicroSoft_SQL_Server')

$db_name =

db_database_name('sales_ds');

$sql_text = '…';

END



db_ownerThe db_owner function returns the real owner name for the datastore configuration that is in use at runtime.

This function is useful if your datastore has multiple configurations because with multiple configurations, you can use alias owner names instead of database owner names. By using aliases instead of real owner names, you limit the amount of time it takes to port jobs to different environments.

For example, you can use this function in a SQL statement instead of using a constant. This allows the SQL statement to use the correct database owner for each job run no matter which datastore configuration is in use.

Syntaxdb_owner(ds_name, alias_name)

Return Value

varchar

Where


alias_name represents the name of the alias that you created in the datastore, and then mapped to the real owner name when you created a datastore configuration.

Example$real_owner = db_owner('sales_ds',

'sales_person');

decodeYou use the decode function to return an expression based on the first condition in the specified list of conditions and expressions that evaluates to TRUE. It provides an alternate way to write nested ifthenelse functions.

Use this function to apply multiple conditions when you map columns or select columns in a query. For example, you can use this function to put customers into different groupings.

Syntaxdecode(condition_and_expression_list, default_expression)

Return Value

expression or default_expression — returns the value associated with the first condition that evaluates to TRUE.

The data type of the return value is the data type of the first expression in the condition_and_expression_list.Note: If the data type of any subsequent expression or the

default_expression is not convertible to the data type of the first expression, Data Integrator produces an error at validation. If the data types are convertible but do not match, a warning appears at validation.


Where

condition_and_expression_list represents a comma-separated list of one or more pairs that specify a variable number of conditions. Each pair contains one condition and one expression separated by a comma. You must specify at least one condition and expression pair:• The condition evaluates to TRUE or FALSE.• The expression is the value that the function returns if the condition

evaluates to TRUE.

default_expression represents an expression that the function returns if none of the conditions in condition_and_expression_list evaluate to TRUE. Note: You must specify a default_expression.

Example

The decode function provides an easier way to write nested ifthenelse functions. In nested ifthenelse functions, you must write nested conditions and ensure that the parentheses are in the correct places as in this example:ifthenelse ((EMPNO = 1), '111',

ifthenelse((EMPNO = 2), '222',



'NO_ID'))))

In the decode function, you list the conditions as in this example: decode ((EMPNO = 1), '111',

(EMPNO = 2), '222',

(EMPNO = 3), '333',

(EMPNO = 4), ‘444’,

'NO_ID')

Therefore, decode is less prone to error than nested ifthenelse functions.

To improve performance, Data Integrator pushes this function to the database server when possible. Thus, the database server, rather than Data Integrator, evaluates the decode function.


Activity: Using the db_name and db_type functions

ObjectivesIn this activity you will use the db_name and db_type functions to determine the source database type and name for a value in the CUSTOMER column.

Instructions1 Build a new job called New_Functions_Job.2 Create a data flow called New_Functions_DF.3 Add the Northwind Customers as your source table.4 Add the Query transform to New_Functions_DF and connect the

Customers source table to the Query transform.5 Open the Query Editor.6 Right-click the Query table in the Schema Out area to create a new

Output column called Name with varchar data type and a length of 50:

7 Remap the COMPANYNAME column from the Schema in column to the NAME column in the Schema out column.

8 Create two more output columns called DB_NAME and DB_TYPE (also as varchar(50)).

9 Select DB_NAME and in the Mapping tab, create this function:db_database_name('Northwind')

Practice


10 For the DB_TYPE output column, create this function:db_type('Northwind')

Your Schema Out area should look like this:

11 Create a target template table called NEW_FUNCTION and connect this to the Query transform.

12 Associate New_Functions_Job to a project, validate, and execute your job.

13 Go back to the New_Functions_DF level and click the magnifying glass on the NEW_FUNCTION target table.You should see the name of the customer, plus the database which they came from (DB_NAME), and the database type (DB_TYPE.):


Activity: Using the decode function

ObjectivesIn this activity you will use the decode function to move customers into different groupings.

InstructionsThe Customers table in the Northwind database are categorized into two continental regions: Americas and Non-Americas.

Customers for the Americas in this table include USA, Brazil, Mexico, Venezuela or Argentina. The remaining customers fall under the Non-Americas category.

We are going to build a function which will put all the customers from the countries USA, Brazil, Mexico, Venezuela or Argentina into a grouping called Americas and the remaining countries into Non-Americas.1 Using the New_Functions_DF, modify the Query transform by adding a

new output column:• DECODE, datatype: varchar(50)

2 From the CUSTOMERS source table, drag the column COUNTRY to the target table and place it under the DECODE column.

3 Select the DECODE column in Schema Out.4 In the in the Mapping tab, click Functions, and select Miscellaneous

Functions.5 From the Function name box, select decode.6 Click Next, and in the Define Input Parameter(s) window, enter:

• Conditional expression:COUNTRY IN ('USA', 'Brazil', 'Mexico', 'Venezuela', 'Argentina')

• Case expression:'Americas'

• Default expression:'Non-Americas'

7 Click Finish.

Practice


The function should appear in the text box of the Mapping tab. You may want to re-organize the function so that it is easier to see:

The Schema Out area should look like this:

8 Add the New_Functions_Jobs to the Exercises project in the project area.

9 In the project area, validate, and execute New_Functions_Job. 10 Go back to the New_Functions_DF level, and click the magnifying glass

in the New_Functions target table. In addition to the customer Name, DB_NAME, and DB_TYPE columns, you should also see COUNTRY, and the corresponding region they have been allocated to by the decode function (in the DECODE column).


Lesson Summary

Quiz: Using Built-in Functions1 Describe the differences between a function and a transform?

2 Why are functions used in expressions?

3 What does a lookup function do?

4 What value would the Lookup_ext function return if multiple matching records were found on the translate table?• Depends on Return Policy (Min or Max)• An arbitrary matching row• Closest value less than or equal to value from sequence column• #MULTIVALUE error for records with multiple matches

After completing this lesson, you are now able to:• Explain what a function is• Differentiate between functions and transforms• List the types of operations available for functions• List the types of functions you can use in Data Integrator• Use functions in expressions• Use date and time functions and the date_generation transform to build a

dimension table• Use lookup functions to look up status in a table• Use match pattern functions to compare input strings to patterns in Data

Integrator• Use database type functions to return information on data sources

Review

Summary


Lesson 8Using Data Integrator Scripting Language and Variables

You can increase the flexibility and reusability of work flows and data flows using global variables when you design your job.

Data Integrator scripting language also extends job flexibility by allowing you to write scripts, custom functions, and expressions.

In this lesson you will learn about:• Understanding variables• Understanding global variables• Understanding Data Integrator scripting language• Scripting a custom function

Duration: 1 hour


Understanding variables

With the Data Integrator scripting language, you can assign values to variables, call functions, and use standard string and mathematical operators. The syntax used in the Data Integrator scripting language can be used in an expression as well as in a script.

After completing this unit, you will be able to:• Understand variables• Describe the variables and parameters window• Explain differences between global and local variables

Variables are symbolic placeholders for values. The data type of a variable can be any supported by Data Integrator such as an integer, decimal, date, or text string.

You can use variables in expressions to facilitate decision-making or data manipulation (using arithmetic or character substitution). The variable name must start with a ($), for example, a variable can be used in a LOOP or IF statement to check a variable's value to decide which step to perform:If $amount_owed > 0 print($invoice.doc);

The above code prints an invoice if the amount owed is greater that $0.

If you define variables in a job or work flow, Data Integrator typically uses them in a script, catch, or conditional process.

A script is a single-use object used to call functions and assign values to variables in a work flow.

Introduction

Understanding variables

Using Data Integrator Scripting Language and Variables—Learner’s Guide 8-3

A catch is part of a serial sequence called a try/catch block. The try/catch block allows you to specify alternative work flows if errors occur while Data Integrator is executing a job. Try/catch blocks catch groups of errors, apply solutions that you provide, and continue execution.

A conditional process uses a conditional, which is a single-use object, available in work flows, that allows you to branch the execution logic based on the results of an expression. The conditional takes the form of an if/then/else statement.

You can increase the flexibility and reusability of work flows and data flows using local and global variables when you design your jobs.

Data Integrator displays the variables and parameters defined for an object in the Variables and Parameters window.

To view the variables and parameters in each job, work flow, or data flow1 In the Tools menu, select Variables.

The Variables and Parameters window opens.2 From the Local object library, double-click an object.

The Context box in the window changes to show the object you are viewing. If there is no object selected, the window does not indicate a context.

The Variables and Parameters window contains two tabs. The Definitions tab allows you to create and view variables (name and data type) and parameters (name, data type, and parameter type) for an object type.

Describing the variables and parameters window


This table lists the type of variables and parameters you can create using the Variables and Parameters window when you select different objects.

The Calls tab allows you to view the name of each parameter defined for all objects in a parent object’s definition.

For the input parameter type, values in the Calls tab can be constants, variables, or another parameter.

For the output or input/output parameter type, values in the Calls tab can be variables or parameters.

Values in the Calls tab must also use:• The same data type as the variable if they are placed inside an input or

input/output parameter type, and a compatible data type if they are placed inside an output parameter type.

• Data Integrator scripting language rules and syntax

Variables can be used as file names for:• Flat file sources and targets• XML file sources and targets• XML message targets (executed in the Designer in test mode)• Document file sources and targets (in an SAP R/3 environment)• Document message sources and targets (SAP R/3 environment)

Object TypeWhat you can create for the object

Used by

Job Local Variables A script or conditional in the job

Job Global Variables Any object in the job

Work Flow Local Variables This work flow or passed down to other work flows or data flows using a parameter.

Work Flow Parameters Parent objects to pass local variables. Work flows may also return variables or parameters to parent objects.

Data flow Parameters A WHERE clause, column mapping, or a function in the data flow. Data flows cannot return output values.


Local variables and parametersIn Data Integrator, local variables are restricted to the object in which they are created (job or work flow). You must use parameters to pass local variables to child objects (work flows and data flows).

Parameters are expressions that pass to a work flow, data flow or custom function when they are called in a job.

Variables are defined in the Variables and Parameter window.

Naming and defining variables• Variable names must be preceded by a dollar sign ($). Local variables

start with $; while global variables can be denoted by $GV.• Global variables used in a script or expression must be defined at the job

level using the Variables and Parameters window.

To define a local variable1 Click the name of the job or work flow in the project area2 From the Tools menu, select Variables to open the Variables and

Parameters window.3 Go the Definition tab.4 Select Variables.5 Right-click and select Insert.6 Right-click the new variable (for example, $NewVariable0) and select

Properties.7 Enter the name of the new variable.

The name can include any alpha or numeric character or underscores (_), but cannot contain blank spaces. Always begin the name with a dollar sign ($).

8 Select the data type for the variable.9 Click OK.

Variable values and the Smart Editor• The return value must be passed outside the function using the following

statement:RETURN(expression)

where expression defines the value to be returned.• Existing variables and parameters displayed in the Smart Editor are

filtered by the context from which the Smart Editor is opened.

Expression and variable substitution• Square brackets substitute the value of the expression, for example:

Print(‘The value of the start date is:[sysdate()+5]’);

• Curly brackets quote the value of the expression in single quotation marks, for example:

• $StartDate = sql(‘target_ds’, ‘SELECT ExtractHigh FROM Job_Execution_Status WHERE JobName = {$JobName}’);


In Data Integrator, local variables are restricted to the object in which they are created (job or work flow). You must use parameters to pass local variables to child objects (work flows and data flows).

Global variables are restricted to the job in which they are created; however, they do not require parameters to be passed to work flows and data flows.

It is recommended that you use global variables if you want to use them throughout a specific job since local variables are restricted to the data flows.

Global variables can simplify your work. For example, during production you can change values for default global variables at runtime from a job's schedule without having to open a job in the Designer.

You can set values for global variables in script objects. You can also set global variable values using external job, execution, or schedule properties.

The recommended naming convention for global variables is $GV_variablename.

When you use global variables, it is important to use consistent application and naming conventions across jobs. Especially, when global variables are used inside work flows and data flows that might be in other jobs that do not have the same global variables.

For example, you create job A, a global variable A at the job level, and work flow A. You then add a script in work flow A that calls global variable A. If you use work flow A in a separate job B, that does not contain global variable A, the your job will not work. Note: While it is more convenient to use global variables, you should

consider, during the design stage, whether you are going to be using the work and data flows that call global variables in other jobs. Consider using local variables if you are going to re-use a work flow that calls global variables in other jobs.

We will focus on global variables for the remainder of this lesson. For more information on using local variables and passing parameters, see “Variables and Parameters”, Chapter 12 in the Data Integrator Designer Guide.

Explaining differences between global and local variables


Understanding global variables

Global variables are global within a job. Using global variables reduces the development time required for passing values between job components and creates a dependency between job level global variable name and job components.

After completing this unit, you will be able to:• Create global variables• View global variables• Set global variable values

Setting parameters is not necessary when you use global variables. However, once you use a name for a global variable in a job, that name becomes reserved for the job. Global variables are exclusive within the context of the job in which they are created. Global variables can be defined in the Variables and Parameters window.

To create a global variable1 Click the name of a job in the project area or double-click a job from the

Local object library.2 From the Tools menu, select Variables to open the Variables and

Parameters window.3 Click on the Definition tab.

4 Right-click Global Variables(Job Context: __) and select Insert.

$NewJobGlobalVariable appears inside the global variables tree:5 Right-click $NewJobGlobalVariable, and select Properties.

The Global Variable Properties window opens:

6 Rename the variable, and select a data type.7 Click OK.

Introduction

Creating global variables


The Variables and Parameters window displays the renamed global variable.

Global variables, defined in a job, are visible to those objects relative to that job. A global variable defined in one job is not available for modifying or viewing from another job.

You can view global variables from the Variables and Parameters window (with an open job in the work space) or from the Properties dialog of a selected job.

To view global variables in a job1 In the Local object library, click the Jobs tab.2 Right-click and select Properties.3 Click the Global Variables tab.

Global variables appear on this tab.

In addition to setting a variable inside a job using an initialization script, you can set and maintain global variable values outside a job. Values set outside a job are processed the same way as those set in an initialization script. However, if you set a value for the same variable both inside and outside a job, the internal value will override the external job value.

Values for global variables can be set outside a job:• As a job property.• As an execution or schedule property.

Global variables without defined values are also allowed. They are read as NULL.

All values defined as job properties are shown in the Properties window. By setting values outside a job, you can rely on the Properties window for viewing values that have been set for global variables and easily edit values when testing or scheduling a job.Note: You cannot pass global variables as command line arguments for real-

time jobs.

Viewing global variables

Setting global variables


To set a global variable value as a job property1 Right-click a job in the Local object library or project area.2 Click Properties.3 Click the Global Variables tab.

All global variables created in the job appear.4 Enter values for the global variables in the Type and Value columns.

You can use any statement used in a script with this option.

5 Click OK.Data Integrator saves values in the repository as job properties.You can also view and edit these default values in the Execution Properties dialog of the Designer. This allows you to override job property values at run-time.


..

Activity: Creating global variables

ObjectivesIn this activity you will create a job that requires a global datetime variable to be initialized in a script referenced in the WHERE clause of a query. Note: Data Integrator will capture source records having a timestamp that

falls within the window of this variable.

Instructions1 Create a new batch job, and name it Variable_Job. 2 Create a data flow and, name it Variable_DF.3 Staying at the job level, from the Designer toolbar, click Tools, and select

Variables.Notice that the job name is displayed in the Variables and Parameters Editor. This is because the variable is created at the job level.

Practice


4 Right-click Global Variables, and select Insert.5 Right-click the newly inserted global variable and rename it to $G_STIME

and define the data type as datetime:

6 Click OK, and close the Variables and Parameters Editor.7 From the tool palette, add a script to the left of Variable_DF.8 Double-click the script to open it.9 Click the ellipsis button beside Functions to open the Smart Editor.10 In the Variables tab of the Smart Editor, expand Global Variables, select

$G_STIME, and drag it into the Smart Editor space.11 Define $G_STIME = ‘1997.01.01 00:00:00’, and click OK.

12 Connect the script to Variable_DF.13 Open Variable_DF.14 Add ODS_ EMPLOYEE and make it a source.15 Add a Query transform to the data flow.16 Connect the query to ODS_EMPLOYEE.17 Add EMPLOYEE_DIM and make it a target.


Tip: If you want to run your job more than once, set the target table options to either delete rows before updating (if you do not want to keep old rows) or use auto-correct load if you want to keep old rows and update the target when you execute the job again.

18 Connect the Query transform to the EMPLOYEE_DIM table.

Defining the Query and referencing the global variable in the script1 In Variable_DF, double-click the Query transform and remap these

values from the Schema In to the Schema Out pane:

2 Set the WHERE clause for the query to:ODS_EMPLOYEE.LAST_UPDATE >= $G_STIME

3 Validate the query and execute the job.4 View data for EMPLOYEE_DIM.

You should have four rows in your target:

EMP_ID EMP_ID

LNAME LNAME

FNAME FNAME

REGION REGION

LAST_UPDATE TIMESTAMP


Understanding Data Integrator scripting language

With the Data Integrator scripting language, you can assign values to variables, call functions, and use standard string and mathematical operators. The syntax can be used in an expression as well as in a script.

After completing this unit, you will be able to:• Explain language syntax• Use strings and variables in Data Integrator scripting language

Data Integrator now supports ANSI SQL-02 varchar behavior.The following changes are now provided to conform to the ANSI SQL-92 varchar behavior:• Treats an empty string as a zero length varchar value (instead of NULL)• Evaluates comparisons to FALSE: when you use the operators Equal (=)

and Not Equal (<>) to compare to a NULL constant, the comparison always evaluates to FALSE

• Uses new IS NULL and IS NOT NULL operators in Data Integrator scripting language to test for NULL values

• Treats trailing blanks as regular characters, instead of trimming them, when reading from all sources

• Ignores trailing blanks in comparisons in transforms (Query and Table_Comparison) and functions (decode, ifthenelse, lookup, lookup_ext, lookup_seq)

Note: If you want your jobs and work flows from a previous version of Data Integrator to use the ANSI varchar behavior, you must make changes in these areas:

• NULLs and empty strings in scripts and script functions• Trailing blanks in sources and functions• NULLs, empty strings, and trailing blanks in transforms

For more information on these, please contact Business Objects Customer Support at http://www.techsupport.businessobjects.com/.

Introduction

Explaining language syntax


Script usageTypically, a script is executed before data flows for initialization steps and used in conjunction with conditionals to determine execution paths. A script may also be used after work flows or data flows to record execution information such as time, or a change in the number of rows in a data set.

Use a script when you want to calculate values that will be passed on to other parts of the work flow. Use scripts to assign values to variables and execute functions.

Syntax for statements in scriptsJobs and work flows can use scripts to define detailed steps in the flow of logic. A script can run functions and assign values to variables, which can then be passed to other steps in the flow.

Statements in a script object or custom function must end with a semicolon (;). Comment lines must start with a # character.

Syntax for column and table references in expressionsExpressions are a combination of constants, operators, functions, and variables that evaluate to a value of a given data type. Expressions can be used inside script statements or added to data flow objects. Because expressions can be used inside data flow objects, they often contain column names.

The Data Integrator scripting language recognizes column and table names without special syntax. For example, you can indicate the start_date column as the input to a function in the Mapping tab of a query as:to_char(start_date, ’dd.mm.yyyy’)

The column start_date must be in the input schema of the query.

If there is more than one column with the same name in the input schema of a query, indicate which column is included in an expression by qualifying the column name with the table name. For example, indicate the column start_date in the table status as:status.start_date

Note that column and table names as part of SQL strings may require special syntax based on the DBMS that the SQL is evaluated by. For example, select all rows from the LAST_NAME column of the CUSTOMER table as:sql(’oracle_ds’,’SELECT CUSTOMER.LAST_NAME FROM CUSTOMER’)

For scripting in Oracle, see “Strings”, Chapter 7 of the Data Integrator Reference Guide.

Basic syntax rules• Statements end with a semicolon (;)• Variables begin with the dollar sign ($)• String values are enclosed in single quotes (')• Comments begin with pound (#)


Statements can include• Instructions such as, BEGIN, END, IF, ELSE, RETURN, WHILE, TRY,

and CATCH• The usual arithmetic and logical operators (see table below)• Calls (by name) to built-in or custom functions

OperatorsThe operators you can use in scripts and expressions are listed in the following table, in order of precedence. Note that when operations are pushed to a DBMS to perform, the precedence is determined by the rules of the DBMS.

Operator Description

+ Addition

- Subtraction

* Multiplication

/ Division

= Assignment, comparison

< Comparison, less than

<= Comparison, less than or equal to

> Comparison, greater than

>= Comparison, greater than or equal to

!= Comparison, not equal to

|| Concatenate

AND Logical AND

OR Logical OR

NOT Logical NOT

IS NULL Comparison, is a NULL value

IS NOT NULL Comparison, is not a NULL value


Script examples$LANGUAGE=’E’;

$StartDate=’1994.01.01’;

$EndDate=’1998.01.31’;

$start_time_str=sql(‘tutorial_ds’, ‘select to_char(start_time, \’YYYY-MM-DD HH24:MI:SS’);

$end_time_str=sql(‘tutorial_ds’, ‘select to_char(max(last_update), \’YYYY-MM-DD HH24:MI:SS’);

$start_time=to_date($start_time_str, ‘YYYY-MM-DD HH24:MI:SS’);

$end_time=to_date($end_time_str, ‘YYYY-MM-DD HH24:MI:SS’);

$end_time=sql(‘tutorial_ds’, ‘select to_char(end_time, \’YYYY-MM-DD HH24:MI:SS’);

if (($end_time = NULL) or ($end_time =’ ‘)) $recovery_needed=1;

else $recovery_needed=0;

Special care must be given to handling of strings. Quotation marks, escape characters and trailing blanks can all have an adverse effect on your script if used incorrectly. This section provides guidance on how to format your strings.

Quotation MarksThe type of quotation marks to use in strings depends on whether you are using identifiers or constants. An identifier is the name of the object (for example, table, column, data flow, or function). A constant is a fixed value used in computation. There are two types of constants:• String constants (for example, 'Hello World' or '1995.01.23').• Numeric constants (for example, 2.14).

Identifiers need quotation marks if they contain special (non-alphanumeric) characters. For example, you need a double quote for

"compute large numbers"

because it contains blanks. Use single quotes for string constants.

Escape charactersIf a constant contains a single quote (’) or backslash (\) or another special character used by the Data Integrator scripting language, then those characters need to be escaped. For example, the following characters must be

Using strings and variables in Data Integrator scripting language


preceded with an escape character to be evaluated properly in a string. Data Integrator uses the backslash (\) as the escape character.

Trailing blankTrailing blanks are not stripped from strings that are used in scripts or user script functions.

NULLs, empty strings, and trailing blanks in sources, transforms, and functionsTo conform to the ANSI VARCHAR standard, Data Integrator:• Treats an empty string as a zero length varchar value (instead of null)• Uses new IS NULL and IS NOT NULL operators in Data Integrator

scripting language• Treats trailing blanks as regular characters, instead of trimming them,

when reading from all sources• Ignores trailing blanks in comparisons

NULL valuesIndicate NULL values using the keyword NULL. For example, you can check whether a column (COLX) is null or not:

COLX IS NULL

COLX IS NOT NULL

Data Integrator does not check for NULL values in data columns. Use the function nvl to remove NULL values. For more information on the nlv function, see “Functions and Procedures”, Chapter 6 in the Data Integrator Reference Guide.

NULL values and empty stringsData Integrator uses the following two rules with empty strings:• When you assign an empty string to a variable, Data Integratortreats the

value of the variable as a zero-length string. An error results if you assign an empty string to a variable that is not a varchar. To assign a NULL value to a variable of any type, use the NULL constant.

• As a constant (’’), Data Integrator treats the empty string a varchar value of zero length. Use the NULL constant for the null value.

Data Integrator uses the following three rules with NULLs and empty strings in conditionals:

Character Example

single quote (‘) ‘World’s Books’

backslash (\) ‘C:\\temp’


1 The Equals (=) and Not Equal to (<>) comparison operators against a null value always evaluates to FALSE. This FALSE result includes comparing a variable that has a value of NULL against a NULL constant. The following table shows the comparison results for the variable assignments $var1 = NULL and $var2 = NULL:

The following table shows the comparison results for the variable assignments $var1 = ‘’ and $var2 = ‘’:

Condition Translates to Returns

If (NULL = NULL) NULL is equal to NULL FALSE

If (NULL != NULL)

NULL is not equal to NULL FALSE

If (NULL = ‘’) NULL is equal to empty string FALSE

If (NULL != ‘’) bbb is not equal to empty string TRUE

If ('bbb' = NULL)

bbb is equal to NULL FALSE

If ('bbb' != NULL)

bbb is not equal to NULL FALSE

If ('bbb' = '' ) bbb is equal to empty string FALSE

If ('bbb' != '') bbb is not equal to empty string TRUE

If ($var1 = NULL)

NULL is equal to NULL FALSE

If ($var != NULL)

NULL is not equal to NUL FALSE

If ($var1 = ‘’) NULL is equal to empty string FALSE

If ($var != ‘’) NULL is not equal to empty string FALSE

If ($var1 = $var2)

NULL is equal to NULL FALSE

If ($var != $var2)

NULL is not equal to NULL FALSE

Condition Translates to Return

If ($var1 = NULL) Empty string is equal to NULL FALSE

If ($var != NULL) Empty string is not equal to NULL FALSE

If ($var1 = ‘’) Empty string is equal to empty string TRUE

If ($var != ‘’) Empty string is not equal to empty string

FALSE


2 Use the IS NULL and IS NOT NULL operators to test the presence of null values. For example, assuming a variable is assigned: $var1=NULL;

3 When comparing two variables, always test for NULL. In this scenario, you are not testing a variable with a value of NULL against a NULL constant (as in the first rule). Either test each variable and branch accordingly or test in the conditional as shown in the second row of the following table.

If ($var1 = $var2) Empty string is equal to Empty string TRUE

If ($var != $var2) Empty string is not equal to Empty string

FALSE


If ('bbb' IS NULL) bbb is NULL FALSE

If ('bbb' IS NOT NULL)

bbb is not NULL TRUE

If ('' IS NULL) Empty string is NULL FALSE

If ('' IS NOT NULL) Empty string is not NULL TRUE

If ($var1 IS NULL) NULL is NULL TRUE

If ($var IS NOT NULL) NULL is not NULL FALSE

Condition Recommendation

if($var1 = $var2) Do not compare without explicitly testing for NULLs. Business Objects does not recom-mend using this logic because any rela-tional comparison to a NULL value returns FALSE.

if ( (($var1 IS NULL) AND ($var2 IS NULL)) OR ($var1 = $var2))

Will execute the TRUE branch if both $var1 and $var2 are NULL, or if neither are NULL but are equal to each other.



Scripting a custom function

You will sometimes need functionality that exists beyond the functionality of the built-in functions that are provided by Data Integrator. To create that functionality, you can create your own Custom Functions in the Data Integrator scripting language.

After completing this unit, you will be able to:• Determine when to use custom functions• Create a custom function

Data Integrator provides a rather large set of built-in functions. These built-in functions are grouped by these categories: • Aggregate• Conversion• Date• Environment• Math • Miscellaneous• String• System• Validation.

When a built-in function does not meet the needs of your application, you can create a custom function.

In Data Integrator, database and application functions, custom functions, and most built-in functions can be executed in parallel within the transforms in which they are used.

Custom functions are:• Written by the user in Data Integrator scripting language• Reusable objects• Managed through the function wizard.

Like other functions, custom functions can return values through:• Function invocation• Output parameters

Variables declared in a function are local to that function.

Introduction

Determining when to use custom functions


You can create your own functions by writing script functions in Data Integrator scripting language using the Smart Editor. Saved custom functions appear in the function wizard and the Smart Editor under the User_Script category. They also are displayed in the Local object library under the Custom Functions tab. You can edit and delete custom functions from the object library.

Consider these guidelines when you create your own functions:• Functions can call other functions• Functions cannot call themselves• Functions cannot participate in a cycle of recursive calls. For example,

function A cannot call function B, which calls function A• Functions return a value• Functions can have parameters for input, output, or both. However, data

flows cannot pass parameters of type output or input/output.

Before creating a custom function, you must know the input, output, and return values and their data types. The return value is predefined to be Return.

To create a custom function1 From the Tools menu, select Custom Functions.2 In the Custom Function area, right-click and select New.3 Enter the name of the new function.4 Enter a description for your function.

Creating a custom function


5 Click Next to open the Smart Editor. In the Smart Editor, you define the return type, parameter list, and any variables to be used in the function.

6 In the Variables tab, click the (+) to reveal Parameters7 Right-click Return and select Properties.8 Select a data type from the Data type list, and click OK

By default, the return data type is set to int:

9 In the Variables tab, right-click Parameters and select Insert.10 Define parameter properties by selecting a Data type and a Parameter

type. For example: Input, Output, or Input/Output.Note: Data Integrator data flows cannot pass variable parameters of type

output and input/output.11 Click OK.

Repeat steps step 7 to step 11 for each parameter required in your function. Note: When adding subsequent parameters, the right-click menu will

include options to Insert Above or Insert Above. Use these menu commands to create, delete, or edit variables and parameters.

12 Complete the text for your function in the right pane.

13 Click the Validate button to validate your function.


If your function contains syntax errors, Data Integrator displays a list of those errors in an embedded pane below the editor. To see where the error occurs in the text, double-click an error. The Smart Editor redraws to show the location of the error. Note: Variables and parameters for an existing custom function are local

to each function. Therefore, they are not displayed in the Variables and Parameters window (accessible from Tools > Variables). Variables and parameters for custom functions can be viewed in the Smart Editor library, under the Variables tab, when you edit the custom function.


.

Activity: Creating a custom function

ObjectivesIn this activity you will write a custom function and test the function by writing the return value to a table.

Instructions1 Create a new job called Northwind_Functions_Job with a new data

flow inside it called Northwind_Functions_DF.2 From the Designer toolbar, click Tools, and select Custom Functions to

create a new custom function called Northwind_Count. 3 Right-click the custom function area to create a new custom function.

For the description type: This is an example of a count custom function.

4 Click Next.5 In the Smart Editor, click the Variables tab, right-click Local Variables,

and click Insert. Call this global variable: $Northwind_Count. 6 Leave the data type as int.

7 Drag the variable $Northwind_Count into the Smart Editor.

8 In the Functions tab, expand Database and double-click f(x) sql.9 Type the remaining command in the Smart Editor. The complete

command is:$Northwind_Count = sql ( ‘Northwind’, ‘select count (*) from Customers’);Return $Northwind_Count ;

10 Validate the script, and click OK.

Practice


11 In the Northwind_Functions_DF select the Northwind CUSTOMERS table as your source table.

12 Create a template table called Customer_Count.13 Add a Query transform to Northwind_Functions_DF and connect the

CUSTOMERS source table and the Customer_Count template table to the Query transform.

14 In the Query Editor, right-click Query in the Schema Out area to create a new Output column called COUNT. Leave the data type as int.

15 In the Mapping tab of the Query Editor, add $Northwind_Count function from the Custom Functions category.

16 To ensure that we don not have duplicate values, select the Distinct Row option in the Select tab of the Query Editor.

17 Validate and save the job. 18 Add Northwind_Functions_Job to the Exercises project in the project

area and execute it.19 After the job is done executing, click the statistics icon in the log file.

The result displayed should be 91 for Row Count and 1 for the Customer Count target template table:

20 View data in the target table. The result displayed should be 91.


Lesson Summary

Quiz: Using Data Integrator Scripting Language and Variables1 Explain the differences between a variable and a parameter.

2 When would you use a global variable instead of a local variable?

3 What is the recommended naming convention for variables in Data Integrator?

After completing this lesson, you are now able to:• Understand variables• Describe the variables and parameters window• Explain differences between global and local variables• Create global variables• View global variables• Set global variable values• Explain language syntax• Use strings and variables in Data Integrator scripting language• Determine when to use custom functions• Create a custom function

Review

Summary


Workshop

Evaluating and validating data

Duration: 45 minutes to 1 hour

ObjectiveIn this activity will use multiple transforms and techniques to define validation rules on your data. You will apply what you have learned in Lessons 5 to 8 to:• Create a data flow that uses more than one source table• Use more than one transform in a data flow to separate and validate data

into target template tables• Use the match_pattern function to compare input strings

ScenarioAs an ETL engineer, you collaborate with others to identify business processes and the data supporting these processes. You determine that you need to use a set of global supplier data in order to support one of the business processes identified.

You know that data quality is an integral component of any ETL job and you must make sure that the supplier data used is clean and reliable. You also know that Data Integrator gives you the ability to use various runtime transformations to evaluate every segment of incoming data and compare the data against a set of pre-defined business rules.

After gathering the requirements set about by the pre-defined business rules, you then use Data Integrator to evaluate and validate the supplier data.

Instructions1 Use these naming conventions to create your job and data flow:

• Validate_Suppliers_Job• Validate_Suppliers_DF

2 Define your data flow using the Northwind Suppliers table as your source.3 Separate Suppliers into US_SUPPLIERS and OTHER_SUPPLIERS

Target template tables.Tip: You can use one of the Data Integrator transforms to simplify branch

logic in data flows by consolidating decision making logic into one transform.

4 After you separate the records, validate US_SUPPLIERS by adding the Validation transform to your data flow. Note: Think about where you may want to load your failed records. Do

you have all the objects necessary to complete the data flow design before you validate US_SUPPLIERS?

Practice


5 This data flow design you should have:• The SUPPLIERS source table. • A transform used for separating the US_SUPPLIERS versus

OTHER_SUPPLIERS.• Validation transform used for the validation rules provided.• Target tables to load all your data into.A graphical representation is also available at the end of this activity but challenge yourself first! See if you can figure this out!

6 Validate US_SUPPLIERS using these validation rules:

Match_pattern function allows you to compare input strings to simple patterns supported by Data Integrator. This function looks for a matching whole string and not substrings.Return Value Integer returns 1 for a match, otherwise 0.Syntax match_pattern(input_string,pattern_string)

Whereinput_string represents the data to be searched. pattern_string represents the pattern you want to find in the input string.Use these characters to create a pattern:

Validation RuleAffected Column

Action on Failure —If this rule fails you want to:

You want to pass US regions in LA, MI, and OR

REGION Send the rows to the Failed table.

You do not want any of your Fax numbers to be Null. For those that are Null replace these with an unknown description

FAX Send these rows to both tables but for the Pass, sub-stitute this with 'Unknown'.

You want the supplier ID to be between 1 and 18

SUPPLIERID Send these rows to the Pass table.

You want to match the pat-tern of the postal code to be 5 digits. For example 90311

POSTALCODE Send these rows to Failed table.

See Match Pattern informa-tion below.

X Represents uppercase characters

x Represents lowercase characters

9 Represents numbers

\ Escape character

* Any characters occurring zero or more times


7 Validate your data flow and execute the job.Note: Data conversion warning messages are ok. Your job should

execute successfully without any errors.Your data flow should look similar to this:

Questions1 How many rows are loaded into the US_SUPPLIERS table?2 How many rows are loaded into the OTHER_SUPPLIERS table?3 What happened to the Fax numbers with Null values that you passed with

the validation rule earlier?4 How do you check for invalid rows? 5 Can you tell why the rows failed? What happens to rejected data?

Solution1 You should have three rows in the US_SUPPLIERS table.2 You should have 25 rows in the OTHER_SUPPLIERS table.3 View data in the US_SUPPLIERS table. You should see that for those fax

numbers that had null values, Null has been replaced with Unknown.4 View data for the INVALID_US_SUPPLIERS table. You should see three

failed rows. Note that one failed row is for Supplier ID 19. Scroll to the right and you will see two additional columns have been inserted.

5 When you use the Validation transform, two rows are added to the Fail output schemas:DI_ERRORACTION and DI_ERRORCOLUMNS. The DI_ERRORACTION column indicates where failed data is sent. This is done with column labels - B for rows that are sent to both Pass and Fail outputs and F for rows that are sent to Fail output only.

? Any single character occurring once and only once

[ ] Any one character inside the braces occurring once

[!] Any character except those after the exclamation point (for example, [!BOBJ] allows anything except BOBJ


The DI_ERRORCOLUMNS column displays all the error messages for columns with failed rules.Note that Supplier ID 19 failed on two validating rules that you set earlier on REGION and SUPPLIER ID. This is because Data Integrator tracks and gives you the ability to view more than one failed rule on one single row at the same time.Data collected in rejected/invalid tables is available with details on which validation rules rejected the data. If there is a need, you can always come back to look at the rejected/invalid data, determine why it failed, assess the quality of the data and then re-evaluate it.

A solution file, Evaluating_and_validating_data_solution.atl, is also included in your resource CD. Import this .atl file into the Designer to view the actual job, data flow and transform definitions.

To import this .atl solution file, right-click in your object library, select repository and click Import from File. Browse to the resource CD to locate the file and open it.Note: Do not execute the solution job as this may override the results in your



Lesson 9Capturing Changes in Data

One of the most common methods for updating large amounts of data regularly in a data warehouse with minimal down time is Changed-Data Capture (CDC). When updating data you can choose to do a full refresh of your data or you can choose to extract only new or modified data and update the target system. This lesson focuses on capturing incremental data changes.

Enabling log-based capture of changed data on SQL Server databases to be applied to a target system is now available in Data Integrator. This CDC feature interacts with SQL Replication Server.

This lesson focuses on using time-stamps because log-based data is more complex and is outside the scope of this course. For more information on using CDC with Microsoft SQL databases, see “Techniques for Capturing Data”, Chapter 19 in the Data Integrator Designer Guide.

In this lesson you will learn about:• Using source-based Changed Data Capture• Using target-based CDC• Using history preserving transforms

Duration: 2 hours


Using source-based Changed Data Capture

Changed Data Capture (CDC) allows you to extract only new or modified data and update the target system after the initial load of a job completes. Data Integrator acts as a mechanism to locate and extract only the incremental data that changed since the last refresh.

After completing this unit, you will be able to:• Explain what Changed Data Capture (CDC) is• Using Changed Data Capture (CDC) with time-stamped sources• Create an initial-load job• Create a delta-load job

There are two general incremental CDC methods: source based and target based CDC. Improving performance and preserving history are the most important reasons for using Changed-Data Capture.

Using CDC:• Reduces the amount of data that needs to be loaded in the delta load:

Performance improves because the job takes less time to process with less data to extract, transform, and load.

• Preserves history:Change history can be tracked by the target system so that data can be correctly analyzed over time, CDC can provide a record of these changes. For example, if a customer moves from one sales region to another, simply updating the customer record to reflect the new region negatively affects any analysis by region over time because the purchases made by that customer before the move are attributed to the new region.

Note: CDC is recommended for large tables. If the tables that you are working with are small, you may want to consider reloading the entire table instead.

Source-based CDC, sometimes also referred to as incremental extraction, extracts only the changed rows from the source. This method is preferred for the performance gain of extracting the least rows.

To use source-based CDC, your source data must have some indication of the change such as timestamps and change logs.

Introduction

Explaining what Changed Data Capture (CDC) is

Capturing Changes in Data—Learner’s Guide 9-3

Your database tables must have at least an update timestamp and preferably also a create timestamp.• Timestamps:

• Indicate either when a row was created or when it was updated• Change logs:

• Are captured by the DBMS• Must provide a way to correlate change log records to data in the

tables

If the table is large and there is too much changed information, it may be better to simply reapply the whole table.

It is recommended to use timestamps to capture changes in source data and use change logs as an audit trail since most change logs are usually quite large.

You can also use source-based CDC with Oracle 9, and IBM DB2 8.2. For more information, see “Techniques for Capturing Changed Data”, Chapter 19 in the Data Integrator Designer Guide and the Data Integrator XI Release Summary.

In these sections, we will use the term timestamp to refer to date, time, or datetime values. The discussion in this section applies to cases where the source table has either CREATE or UPDATE timestamps for each row.

Use Timestamp-based CDC to track changes if:• You are using sources other than Oracle 9i, IBM DB2, or mainframes

accessed with Attunity Connect or IBM DB2 Information Integrator Classic Federation for z/OS.

• There are date and time fields in the tables being updated.• You are updating a large table that has a small percentage of changes

between extracts and an index on the date and time fields.• You are not concerned about capturing intermediate results of each

transaction between extracts (for example, if a customer changes regions twice in the same day).

It is not recommend that you use Timestamp-based CDC when:• You have a large table, a large percentage of it changes between

extracts, and there is no index on the timestamps.• You need to capture physical row deletes.• You need to capture multiple events occurring on the same row between

extracts.

Timestamps can indicate whether a row was created or updated. Some tables have both create and update timestamps; some tables have just one. This section assumes that tables contain at least an update timestamp. For other situations, see the “Types of timestamp” section, Chapter 19 in the Data Integrator Designer Guide.

Some systems have timestamps with dates and times, some with just the dates, and some with monotonically generated increasing numbers. You can treat dates and generated numbers the same. It is important to note that for the

Using Changed Data Capture (CDC) with time-stamped sources


timestamps based on real time, time zones can become important. If you keep track of timestamps using the nomenclature of the source system (that is, using the source time or source-generated number), you can treat both temporal (specific time) and logical (time relative to another time or event) timestamps the same way.

The basic technique for using timestamps is to determine changes and to save the highest timestamp loaded in a given job and start the next job with that timestamp.

To do this, you need to create a status table that tracks the timestamps of rows loaded in a job. At the end of a job, UPDATE this table with the latest loaded timestamp. The next job then reads the timestamp from the status table and selects only the rows in the source for which the timestamp is later than the status table timestamp.

This example illustrates the technique. Assume that the last load occurred at 2:00 PM on January 1, 1998. At that time, the source table had only one row (key=1) with a timestamp earlier than the previous load. Data Integrator loads this row into the target table and updates the status table with the highest timestamp loaded: 1:10 PM on January 1, 1998. After 2:00 PM Data Integrator adds more rows to the source table:

At 3:00 PM on January 1, 1998, the job runs again. This time the job:1 Reads the Last_Timestamp field from the status table (01/01/98 01:10

PM).2 Selects rows from the source table whose timestamps are later than the

value of Last_Timestamp. The SQL command to select these rows is: SELECT * FROM Source WHERE Update_Timestamp > ’01/01/98 01:10 pm’ This operation returns the second and third rows (key=2 and key=3).

3 Loads these new rows into the target table.4 Updates the status table with the latest timestamp in the target table (01/

01/98 02:39 PM) with the following SQL statement:UPDATE STATUS SET Last_Timestamp = SELECTMAX(’Update_Timestamp’) FROM target_table

Key Data Update_Timestamp

Alvarez 01/01/98 01:10 PM

01/01/98 02:12 PM

01/01/98 02:39 PM

Tanaka

3 Lani

1

2

Key Data Update_Timestamp

Alvarez 01/01/98 01:10 PM1

Last_Timestamp

01/01/98 01:10 PM

Source table

Target table

Status table


The target shows the new data:

CDC timestamped sources require that you first create a job to execute the work flow. This work flow must:• Read the status table• Set the value of a variable to the last timestamp• Call the data flow with the variable passed to it as a parameter• Update the status table with the new timestamp

Note: Timestamps are set in scripts before and after the data flow.

01/01/98 02:39 PM

01/01/98 02:12 PM

01/01/98 01:10 PM

Update_Timestamp

Lani3

Tanaka

Alvarez

Data

2

1

Key

Source table

Target table

01/01/98 02:39 PMLani3

01/01/98 02:12 PMTanaka2

01/01/98 01:10 PM

Update_Timestamp

Alvarez1

DataKey

Last_Timestamp

01/01/98 02:39 PM

Status table

$Last_Timestamp_var=sql(‘target_ds’, ‘SELECT to_char(last_timestamp, \’YYYY.MM.DD HH24:MI:SS\’) FROM status_table’);

Work flow: Changed data with timestamps

12

4

$Last_Timestamp_var=sql(‘target_ds’, ‘UPDATE status_table SET last_timestamp = (SELECT MAX(target_table.update_timestamp) FROM target_table)’);

GetLastStamp Update_Target SetNewStamp

3


Next, you need to specify a data flow containing a source table, a query and a target table (assuming the required metadata for the source and target tables has been imported).

You can execute the initial-load job and view the results to verify that the job returned all the rows from the source. After the initial-load is executed, you can run the delta-load job to extract only the changes and update the target table.

Keep in mind that unless source data is rigorously isolated during the extraction process, overlaps may occur when there is a window of time when changes can be lost between two extraction runs. These are more advanced concepts that go beyond the scope of this course, but if you want more information, see, see “Overlaps” in “Techniques for Capturing Changed Data”, Chapter 19 in the Data Integrator Designer Guide.

Initial and Delta loadsData Integrator handles initial and delta loads this way:• Initial load

The first time you execute the batch job, the Data Integrator Designer uses the initial load to create the data tables and retrieve data from the first row - where N is the number of rows defined to be executed in the data flow at once. By default, the initial load runs the first time you execute the job.

• Delta loadAfter creating tables with an initial load, the delta work flow incrementally loads the next number of rows specified in the data flow. The delta load retrieves only data that has been changed or added since the last load iteration. When you execute your job, the delta load may run several times, loading data from the specified number of rows each time until all new data has been written to the target database.

Data flow: Changed data with timestamps

SOURCE Query TARGET

The query selects rows from SOURCE_TABLE too load from TARGET_TABLE.

The query includes a WHERE clause to filter rows with older timestamps.


To create an initial load jobFor the procedures below, make sure you give the job, work flow and data flow distinctive names that will help you identify them. For example, you may want to name your job CDC_Initial_Job.1 Create a job.2 Add the global variables $GV_LastTimeStamp and

$GV_NewTimeStamp with a datetime datatype to your job.

3 Add a work flow to your job and name it.4 Double-click the work flow and add a script to your work flow to get the

Last Timestamp.5 Add a data flow to your work flow. Name your data flow.6 Add a script to your work flow to set the New Timestamp.7 Connect the two scripts and the data flow together.

Your work flow definition should look similar to this:

8 Double-click each to define the scripts in your work flow using the scripts below as reference:• To get the Last Timestamp:$GV_LastTimestamp = 'YYYY.MM.DD HH24:MI:SS';$GV_NewTimeStamp = sysdate();

• To set the New Timestamp:TARGET.Timestamp_Fieldname');sql('target_ds', 'INSERT INTO

TARGET.TimestampField VALUES ({$SetNewStamp})');WHERE Timestamp_Fieldname represents the timestamp you created in your databaseNote: These scripts are for use with SQL Server only.

Creating an initial load job


9 On the toolbar, click to validate the script.

10 Double-click .

11 In the Designer workspace, add a source table, the Query transform, and a target table to define the data flow.

12 Connect the source and target tables with the Query transform.13 Double-click the Query transform.14 In the Query Editor, drag the corresponding input schema columns to the

output schema columns to determine your output schema.15 In Query Editor, click the Where tab, and type your script using this as

reference:(TABLENAME.LASTUPDATE >=$GETLASTSTAMP ) and(TABLENAME.LASTUPDATE <= $SETNEWSTAMP)

16 Validate your current data flow.17 In the Designer workspace, click the job tab.

18 On the toolbar, click to validate the job.

19 Save your initial load job.

Adding the delta-load job is easier when you replicate and modify the initial-load data flow you just created.

To add the delta-load data flowFor the procedure below, make sure you give the data flow a distinctive name that will help you identify it. For example, you may want to name your data flow CDC_Delta_DF.1 In the Local object library, right-click the data flow that you created for the

initial-load job, and select Replicate.2 Right-click the data flow you just recreated, and select Rename to

rename the data flow. 3 Open the data flow in the Designer workspace.4 Double-click the target table.5 In the Target Table Editor, click the Options tab.6 Clear the Delete data from table before loading check box.


To create a delta load jobFor the procedures below, make sure you give the job and work flow distinctive names that will help you identify them. For example, you may want to name your job CDC_Delta_Job. 1 Create a job.2 Add the global variables $GV_LastTimeStamp and

$GV_NewTimeStamp with the datetime datatype to your job.3 Add a work flow to your job.4 Add a script to your work flow to get the Last Timestamp.5 Add the delta-load data flow.6 Add a script to your work flow to set the New Timestamp.7 Connect the two scripts and the data flow together.8 Double-click each script to define the scripts in your work flow using the

scripts below as reference:• To get the Last Timestamp:$GV_LASTTIMESTAMP= to_date(sql('target_ds','SELECT LAST_TIME FROM TARGET.Timestamp_Fieldname'),'YYYY-MM-DD HH24:MI:SS');$GV_NEWTIME = sysdate();

Note: For the delta-load job, you define the start-time global variable to be the last time stamp recorded in Timestamp_Fieldnam.

• To get the New Timestamp:sql('target_ds', 'UPDATE TARGET.Timestam_Fieldname SETLAST_TIME ={$GV_NEWTIMESTAMP}');

WHERE Timestamp_Fieldname represents the timestamp you created in your database.Note: These scripts are for use with SQL Server only.

9 In the Designer workspace, click the Job tab.

10 On the toolbar, click to validate the job.

11 Save your delta-load job.Note: If your system does not use timestamps, you can use a change log to

capture a full audit trail or use real time CDC. For more information, see “Real-time Jobs”, Chapter 10 in the Data Integrator Designer Guide.

Creating a delta load job


OverlapsUnless source data is rigorously isolated during the extraction process (which typically is not practical), there is a window of time when changes can be lost between two extraction runs. This overlap period affects source-based CDC because this kind of data capture relies on a static timestamp to determine changed data.

For example, suppose a table has 1000 rows ordered one to 1000. The job starts with timestamp 3:00 and extracts each row. While the job is executing, it updates two rows, one and 1000 with timestamps 3:01 and 3:02, respectively. The job extracts row 200 when someone updates row one. When the job extracts row 300, it updates row 1000. When complete, the job extracts the latest timestamp 3:02 from row 1000 but misses the update to row one.

There are three techniques for handling this situation:• Overlap avoidance• Overlap reconciliation• Presampling

These techniques are discussed briefly below but are outside the scope of this course. For more information see “Source-based and target-based CDC” in “Techniques for Capturing Changed Data” Chapter 19 in the Data Integrator Designer Guide.

Overlap avoidanceIn some cases, it is possible to set up a system where there is no possibility of an overlap. You can avoid overlaps if there is a processing interval where no updates are occurring on the target system.

For example, if you can guarantee the data extraction from the source system does not last more than one hour, you can run a job at 1:00 AM every night that selects only the data updated the previous day until midnight. While this regular job does not give you up-to-the-minute updates, it guarantees that you never have an overlap and greatly simplifies timestamp management.

Overlap reconciliationOverlap reconciliation requires a special extraction process that reapplies changes that could have occurred during the overlap period. This extraction can be executed separately from the regular extraction. For example, if the highest timestamp loaded from the previous job was 01/01/98 10:30 PM and the overlap period is one hour, overlap reconciliation reapplies the data updated between 9:30 PM and 10:30 PM on January 1, 1998.

The overlap period is usually equal to the maximum possible extraction time. If it can take up to N hours to extract the data from the source system, an overlap period of N (or N plus some small increment) hours is recommended. For example, if it takes at most two hours to run the job, an overlap period of at least two hours is recommended.


PresamplingPresampling eliminates the overlap by first identifying the most recent timestamp in the system, saving it, and then extracting rows up to that timestamp.

The technique is an extension of the CDC timestamp processing technique described previously. The main difference is that the status table contains a start and an end timestamp after you used the CDC timestamp process to load the initial and delta-load jobs. The start timestamp is the latest timestamp extracted by the previous job and the end timestamp is the timestamp selected by the current job.


Activity: Using scripts and timestamps for CDC

ObjectiveIn this activity you will:• Create variables• Set scripts to specify the initial load in a job

Instructions1 In the object library, right-click Variable_Job, and select Replicate.2 Rename Copy_1_Variable_Job to CDC_Job.3 Repeat these procedures to replicate Validate_DF, and rename it to

CDC_DF.You want to replicate Variable_Job and Variable_DF so that you can carry over the $G_STIME global variable and Where clause you defined earlier.

4 Open CDC_Job in the workspace.Notice that Variable_DF is still contained within this job.

5 Right-click Variable_DF, and delete it.6 Add CDC_DF to the workspace.7 Staying at the job level, create another global variable called $G_ETIME

with a datetime data type.8 Open the script beside CDC_DF. 9 Use the Smart Editor and set the $G_STIME and $G_ETIME to start from

1998.01.01 00:00:00 to 1999.12.31 00:00:00:

10 Go back to the CDC_Job level and connect the script to CDC_DF.

Practice


11 Open CDC_DF and in the Query transform, modify the Where clause to be:ODS_EMPLOYEE.LAST_UPDATE >= $G_STIMEand ODS_EMPLOYEE.LAST_UPDATE <= $G_ETIME

12 Validate and execute CDC_Job.You should have four rows in your target:


Activity: Creating an initial load

ObjectiveIn this activity you will create the first job that initially loads all of the rows from a source table. This initial job contains:• An initialization script that sets values for two global variables:

$GV_STARTTIME and $GV_ENDTIME• A data flow that loads only the rows with dates that fall between

$GV_STARTIME and $GV_ENDTIME• A termination script that updates the CDC_TIME target table that stores

the last $GV_ENDTIME

Instructions1 Create a new job called CDC_Initial_Job.2 From the Designer toolbar, click Tools, and select Variables.3 Right-click Global Variables, and select Insert.4 Right-click the newly inserted global variable, select Properties, and

rename it to $GV_STARTTIIME. Define the data type as datetime.5 Repeat these procedures to create another global datetime variable

named $GV_ENDTIME.6 Add a work flow to your job and name it CDC_Initial_WF.7 Open the workflow, add a script, and name it SET_START_END_TIME.8 Define this script with this SQL:

$GV_STARTTIME = ‘1996.01.01 12:00:00’;$GV_ENDTIME = sysdate ();

Tip: You can copy and paste the script from the Resource CD. From the Resource CD, browse to the Activity_Source folder, and open Lesson_9_scripts.txt.

9 Validate the script.10 Add a data flow called CDC_Initial_DF to the right of

SET_START_END_TIME script.11 Add another script to the right of CDC_Initial_DF and call it

UPDATE_CDC_TIME_TABLE.12 Define this script with this SQL:

sql(‘target_ds’, ‘DELETE FROM CDC_TIME’);sql (‘target_ds’, ‘INSERT INTO CDC_TIME VALUES ({$GV_ENDTIME})’);


13 Validate the script.Note: For this exercise, datetime conversion warning messages are ok.

Practice


14 Double-click CDC_Initial_DF and add the ODS_EMPLOYEE table as a source, a Query transform, and a target template table called EMP_CDC.

15 Connect the source, Query and target.16 Double-click the Query transform and map these columns:

• EMP_ID• LNAME• FNAME• REGION• LAST_UPDATE

17 In the Query transform editor, click the Where tab, and from the Schema In pane, drag LAST_UPDATE into the editor to define this WHERE clause:(0DS.EMPLOYEE.LAST_UPDATE >= $GV_STARTTIME) AND(0DS.EMPLOYEE.LAST_UPDATE <= $GV_ENDTIME)

18 Go back to the data flow level.19 Double-click EMP_CDC, click the Options and, select Delete data from

table before loading.20 In your work flow level, connect the scripts and CDC_Initial_DF.21 Validate and execute the job.

You should have 4 records in EMP_CDC.


Activity: Creating a delta load

ObjectiveIn this activity you will:• Make a change to the source data• Modify the initial job to identify only the rows that have been added or

changed in the source

Instructions

Modify the source data1 In MS SQL Server Enterprise Manager, expand Databases.2 Expand ODS, and double-click Tables.3 Right-click ODS_EMPLOYEE, point to Open Table, and select Return

all rows.4 Add yourself to the employee table with EMP_ID =5, cell 5, and today’s

date.5 Save the changes.

Create the delta job1 In the Designer, right-click CDC_Initial_Job, and select Replicate.2 Rename Copy_1_CDC_Initial_Job to CDC_Delta_Job.3 Double-click CDC_Delta_Job to open it in the workspace.4 From the Designer tool menu, click Tools, and select Variables.

You should see that the $GV_STARTTIME and $GV_ENDTIME are available for use in this job.

5 Repeat these procedures to replicate CDC_Initial_WF, and rename it to CDC_Delta_WF.

6 Repeat these procedures to replicate CDC_Initial_DF, and rename it to CDC_Delta_DF.

7 In the CDC_Delta_Job level, and delete CDC_Initial_WF.8 Add CDC_Delta_WF to the job workspace.9 Double-click CDC_Delta_WF, and modify the SET_START_END_TIME

and the UPDATE_CDC_TIME_TABLE script content with the following:• For SET_START_END_TIME, define this script with:$GV_STARTTIME = to_date(sql(‘target_ds’, ‘SELECT LAST_TIME FROM CDC_TIME’),‘1996.01.01 12:00:00’);$GV_ENDTIME = sysdate ();

• For UPDATE_CDC_TIME_TABLE, define this script with:sql(‘target_ds’, ‘UPDATE CDC_TIME SET LAST_TIME = {$GV_ENDTIME}’);

Practice



10 Validate both scripts.Note: For this exercise, datetime conversion warning messages are ok.

11 Go back to the work flow level, and delete CDC_Initial_DF.12 Add CDC_Delta_DF back to the work flow level workspace.13 Connect the scripts and CDC_Delta_DF.14 Double-click CDC_Delta_DF to open it.15 In the EMP_CDC target table editor, click the Options tab, and clear the

Delete data from table before loading option.16 From the EMP_CDC target table editor, under Update control, select the

Auto correct load option.17 Go back to the job level.18 Validate and execute CDC_Delta_Job.19 In CDC_Delta_DF, double-click EMP_CDC.

You should yourself added as the fifth row in the target table as shown:

20 Browse to Target_DS and view data for CDC_TIME. You will note that this has been updated with the date for the last entry in the source ODS_EMPLOYEE table.

Modify the source data1 In MS SQL Query Analyzer, expand ODS.2 Expand User Tables, right-click ODS_EMPLOYEE, point to Script

Object to New Window As, and select Delete.3 In the Query window, modify the DELETE statement to display:

DELETE FROM [ODS].[DBO].[ODS_EMPLOYEE] WHERE EMP_ID = 5

4 From the menu bar, click Query, and select Execute.5 In the Designer, run CDC_Delta_Job again and you should see that your

record is deleted from the target.6 Browse to Target_DS and view data for CDC_TIME. You will note that

this has been updated with the date for the last entry in the source ODS_EMPLOYEE table.


Using target-based CDC

Source-based CDC is almost always preferable to target-based CDC for performance reasons. However, some source systems do not provide enough information to make use of the source-based CDD techniques.

Target-based CDC extracts all the data from the source, but loads only the changed rows into the target. This is useful is when your source-based change information is limited. You can compare tables and use the Table_Comparison transform in Data Integrator to support this method.

After completing this unit, you will be able to:• Use table comparison to support target CDC

The table_comparison transform examines all source rows and performs these operations:• Generates an INSERT for any new row not in the target table• Generates an UPDATE for any row in the target table that has changed• Ignores any row that is in the target table and has not changed• Fills in the generated key for the updated rows

Here is an example:

After the Table_Comparison transform runs, you can include the Key_Generation transform to assign a new key for every INSERT in the table comparison results. This is the data set that Data Integrator loads into the target table.

To accomplish the operation described above, your data flow must include a:• Source to extract the rows from the source table(s)• Query to map columns from the source

Introduction

Table comparison


• Table-comparison transform to generate INSERT and UPDATE rows and to fill in existing keys

• Key-generation transform to generate new keys• Target to load the rows into the customer dimension table.

1 2 3

5

Data flow: Load only updated or new rows

SOURCE Query Table Comparison Key Generation CUSTOMER

4


Using history preserving transforms

History preserving allows the data warehouse or data mart to maintain the history of data so you can analyze it over time. Most likely, you will perform history preservation on dimension tables.

For example, if a customer moves from one sales region to another, simply updating the customer record to reflect the new region would give you misleading results in an analysis by region over time because all purchases made by a customer before the move would incorrectly be attributed to the new region.

After completing this unit, you will be able to:• Explain what history preservation is • Identify history preserving transforms

Data Integrator provides a special transform that preserves data history to prevent this kind of situation. The History_Preserving transform ignores everything but rows flagged as UPDATE. For these rows, it compares the values of specified columns and, if the values have changed, flags the row as INSERT. This produces a second row in the target instead of overwriting the first row.

To expand on how Data Integrator would handle the example of the customer who moves between regions:• If Region is a column marked for comparison, the History_Preserving

transform generates a new row for that customer.• A Key_Generation transform gives the new row a new generated key and

loads the row into the customer dimension table.• The original row describing the customer remains in the customer

dimension table with a unique generated key.

In the following example, one customer moved from the East region to the West region, and another customer’s phone number changed.

Introduction

Explaining history preservation


In this example, the data flow preserves the history for the Region column but does not preserve history for the Phone column.

You can preserve history by creating a data flow that contains a:• Source to extract the rows from the source table(s)• Query to map columns from the source• Table-comparison transform to generate INSERTs and UPDATEs and to

fill in existing keys• History_Preserving transform to convert certain UPDATE rows to

INSERT rows • Key-generation transform to generate new keys for the updated rows that

are now flagged as INSERT• Target to load the rows into the customer dimension tableThe resulting dimension table is:

Because the Region column was set as a Compare column in the History_Preserving transform, the change in the Jane's Donuts row created a new row in the customer dimension table.

Since the Phone column was not used in the comparison, the change in the Sandy's Candy row did not create a new row but updated the existing one.

Now that there are two rows for Jane's Donuts, correlations between the dimension table and the fact table must use the highest key value. Note: Updates to non-history preserving columns update all versions of the

row if the update is performed on the natural key (for example, Customer), but only update the latest version if the update is on the generated key (for example, GKey). You can control which key to use for updating by appropriately configuring the loading options in the Target Editor. For more information see “Preserving History”, Chapter 19 in the Data Integrator Designer Guide

Generated keysFor performance reasons, many data warehouse dimension tables use generated keys to join with the fact table.

For example, customer ABC has a generated key 123 in the customer dimension table. All facts for customer ABC have 123 as the customer key. Even if the customer dimension is small, you cannot simply reload it every time a record changes. Unless you assign the generated key of 123 to the customer ABC, the customer dimension table and the fact tables do not correlate.

You can preserve generated keys by using the lookup function and comparing tables in your data flow. For more information on using these, see “Preserving Generated Keys”, Chapter 19 in the Data Integrator Designer Guide.

Target customer table

GKey Data Region Phone

Fred’s Coffee East

East

3 Sandy’s Canada Central (115) 231-1233

West

Jane’s Donuts

(212) 123-4567

(201) 777-1717

(650) 222-12124 Jane’s Donuts

1

2 Updated rows

New row


Data Integrator supports history preservation with three built-in transforms:• Table_Comparison transform• History_Preserving transform• Key_Generation transform

This combination of transforms:• Compares data to see what has changed • Changes the code of an update to be inserted to preserve history• Generates new IDs for key columns

An example is provided below:

Table_Comparison transformThe Table_Comparison transform allows you to detect and forward changes that have occurred since the last time a target was updated. This transform compares two data sets and produces the difference between them as a data set with rows flagged as INSERT or UPDATE. It also allows you to identify changes to a target table for incremental updates as in the example below:

Identifying history preserving transforms

Comparison table:

Output:

ID Name Address Op Code

10 Joe 11 Rehab Street Update

20 Sid 13 Cadillac Drive Update

50 Elvis Graceland, Memphis Insert

Input:


10 Joe 11 Rehab Street Normal

20 Sid 13 Cadillac Drive Normal

30 Charlie 15 Yukon Lane Normal

40 Dolly Goldrush Saloon Normal

50 Elvis Graceland, Memphis Normal


10 Joe 12 Halfway House Normal

20 Sid 13 Deadhead Road Normal




Data Input

The input data set must be flagged as NORMAL. If the input data set contains hierarchical data, only the top-level data is included in the comparison, and nested schemas are not passed through to the output.Note: Use caution when using columns of data type real in this transform

since comparison results are unpredictable for this data type.Databases store real values as a 32-bit approximation of the number. Because of this approximation, comparison results are unpredictable when a real value is used in an equality or inequality comparison. Therefore, it is recommended that you do not use a real value in a WHERE clause. Real values appear in WHERE clauses that Data Integrator generates when a column of type real is used. For example, as a compare column in the Table_Comparison transform.

Options

The Table_Comparison transform offers the following options:• Table name: the fully qualified name of the source table from which the

maximum existing key is determined (key source table). This table must already be imported into the repository. Table name is represented as DATASTORE.OWNER.TABLE where DATASTORE is the name of the datastore Data Integrator uses to access the key source table and OWNER depends on the database type associated with the table.

• Generated key column: (Optional) A column in the comparison table. When there is more than one row in the comparison table with a given primary key value, this transform compares the row with the largest generated key value of these rows and ignores the other rows.

• Comparison method: allows you to select the method for accessing the comparison table. You can select from Row-by-row select, Cached comparison table, and sorted input.

• Input primary key column(s): the column(s) in the input data set which uniquely identify each row. These columns must be present in the comparison table with the same column names and data types.

• Compare columns: (Optional) Improves performance by comparing only the sub-set of columns you drag into this box from the input schema. If no columns are listed, all columns in the input data set that are also in the comparison table are used as compare columns.

Data Output

A data set containing rows flagged as INSERT or UPDATE. This data set contains only the rows that make up the difference between the two input sources. The schema of the output data set is the same as the schema of the comparison table. Note that no DELETE operations are produced.

The transform compares two data sets, one from the input to the transform (input data set), and one from a database table specified in the transform (the comparison table). The transform selects rows from the comparison table based on the primary key values from the input data set. The transform compares columns that exist in the schemas for both inputs.

If a column has a date data type in one table and a datetime data type in the other, the transform compares only the date section of the data. The columns can also be time and datetime data types, in which case Data Integrator only compares the time section of the data.


For each row in the input data set, there are three possible outcomes from the transform:• An INSERT column is added: the primary key value from the input data

set does not match a value in the comparison table. The transform produces an INSERT row with the values from the input data set row.If there are columns in the comparison table that are not present in the input data set, the transform adds these columns to the output schema and fills them with NULL values.

• An UPDATE row is added: the primary key value from the input data set matches a value in the comparison table, and values in the non-key compare columns differ in the corresponding rows from the input data set and the comparison table.The transform produces an UPDATE row with the values from the input data set row. If there are columns in the comparison table that are not present in the input data set, the transform adds these columns to the output schema and fills them with values from the comparison table.

• The row is ignored: the primary key value from the input data set matches a value in the comparison table, but the comparison does not indicate any changes to the row values.

History_Preserving transformThe History_Preserving transform allows you to produce a new row in your target rather than updating an existing row. You can indicate in which columns the transform identifies changes to be preserved.

If the value of certain columns change, this transform creates a new row for each row flagged as UPDATE in the input data set.

Data Input

A data set that is the result of a comparison between two images of the same data in which changed data from the newer image are flagged as UPDATE rows and new data from the newer image are flagged as INSERT rows.

For example, a target table that contains employee address information is updated periodically from a source table. In this case, the table comparison flags changed data for the employees Joe and Sid. The result is a single row flagged with the INSERT operation code as shown below:

Comparison table:

Output:


10 Joe 11 Rehab Street Insert

20 Sid 13 Cadillac Drive Insert

30 Charlie 15 Yukon Lane Update

50 Elvis Graceland, Memphis Insert


10 Joe 12 Halfway House Normal

20 Sid 13 Deadhead Road Normal




The input data set can contain hierarchical data. The transform operates only on the rows at the top-level of the input data set, and passes nested data through to the output without change. Columns containing nested schemas cannot be used as transform parameters.Note: Use caution when using columns of data type real in this transform

since comparison results are unpredictable for this data type.Databases store real values as a 32-bit approximation of the number. Because of this approximation, comparison results are unpredictable when a real value is used in an equality or inequality comparison. Therefore, it is recommended that you do not use a real value in a WHERE clause. Real values appear in WHERE clauses that Data Integrator generates when a column of type real is used.

Options

The History_Preserving transform offers these options:• Date columns: valid from: a date or datetime column from the source

schema. Specify a Valid from date column if the warehouse uses an effective date to track changes in data.

• Date columns: valid to: a date or datetime column from the source schema. Specify a Valid to date column if the warehouse uses an effective date to track changes in data and if you specified a Valid from date column.

• Date columns: valid to date value: a date value specified as a four-digit year followed by a period, followed by a two-digit month, followed by a period, and followed by a two-digit day value.

• Current flag: column: a column from the source schema that identifies the current valid row from a set of rows with the same primary key. The flag column indicates whether a row is the most current data in the warehouse for a given primary key.

• Current flag: set value: an expression that evaluates to a value with the same data type as the current flag Column. This value is used to update the current flag column in the new row in the warehouse added to preserve history of an existing row.

• Current flag: reset value: an expression that evaluates to a value with the same data type as current flag Column. This value is used to update the current flag column in an existing row in the warehouse that included changes in one or more of the compare columns.

• Compare columns: the column or columns in the input data set for which this transform compares the before and after images to determine if there are changes.If the values in each image of the data match, the transform flags the row as UPDATE. The result updates the warehouse row with values from the new row. The row from the before image is included in the output as UPDATE to effectively update the date and flag information.If the values in each image do not match, the row from the after image is included in the output of the transform flagged as INSERT. The result adds a new row to the warehouse with the values from the new row.

• Input contains duplicate keys: provides support for input rows with duplicate primary key values. Each data set containing duplicate primary keys is loaded correctly.


Data Output

A data set with rows flagged as INSERT or UPDATE.

Key_Generation transformThe Key_Generation transform generates new keys for new rows in a data set. When it is necessary to generate artificial keys in a table, this transform looks up the maximum existing key value from a table and uses it as the starting value to generate new keys. The transform expects the generated key column to be part of the input schema.

Use the Key_Generation transform to produce keys that distinguish rows that would otherwise have the same primary key.

For example, suppose the History_Preserving transform produces rows to add to a warehouse. These rows also have the same primary key as rows that already exist in the warehouse. In this case, you can add a generated key to the warehouse table to distinguish these two rows that have the same primary key.

Data Input

A data set that is the result of a comparison between two images of the same data in which changed data from the newer image are flagged as UPDATE rows and new data from the newer image are flagged as INSERT rows.

The data set includes a column into which generated keys are added. The input data set can contain hierarchical data. The transform operates only on the rows at the top-level of the input data set, and passes nested data through to the output without change. Columns containing nested schemas cannot be used as transform parameters.

Options

The Key_Generation transform offers these options:• Table name: the fully qualified name of the source table from which the

maximum existing key is determined (key source table). This table must be already imported into the repository. Table name is represented as DATASTORE.OWNER.TABLE where DATASTORE is the name of the datastore Data Integrator uses to access the key source table and OWNER depends on the database type associated with the table.

• Generated key column: the column in the key source table containing the existing keys values. A column with the same name must exist in the input data set; the new keys are inserted in this column.

• Increment values: the interval between generated key values.

Data Output

The input data set with the addition of key values in the generated key column for input rows flagged as INSERT.


Activity: Preserving history for changes in data with slowly changing dimensions

ObjectiveIn this activity you will:• Use the Table_Comparison, History_Preserving and Key_Generation

transforms to capture data from a slowly changing dimension while preserving data history.

To complete this activity successfully, you will need to:• Add a new column (SKEY) to the slowly changing dimension target

EMPLOYEE_DIM.

Generally, employee information such as address and phone # are very susceptible to change. In an SCD table, you preserve history by keeping the old row and adding new rows for the updated data. As a result, you may have duplicate employee ID records; thus, invalidating the EMP_ID primary key.

To resolve this you use a combined primary key consisting of EMP_ID and a surrogate key to add rows for the same EMP_ID and also keep the original EMP_ID number. When updating data, the Key_Generation transform looks at the highest skey value and assigns the next increment value to the next added row.

Adding the skey1 In MS SQL Enterprise Manager, browse to the Target database.2 Double-click Tables.3 Right-click EMPLOYEE_DIM, and select Design Table.4 Right-click LName, and select Insert to insert a new column.5 Add a new column:

• Column Name: SKEY• Data Type: int• Length: 4 • Do not allow Nulls

6 Select Emp_ID, and pressing the Shift key on your keyboard, select SKEY.You should now have both Emp_ID and Skey highlighted.

7 Right-click selected fields, and select Set Primary Key

Practice


Your EMPLOYEE_DIM table should look like this:

8 Save the changes.9 In the Designer, in the Datastores tab, navigate to EMPLOYEE_DIM

table.10 Right-click EMPLOYEE_DIM table, and select Reimport.

A message indicating that reimport overwrites changes in this table and may invalid its use displays.

11 Click Yes to accept re-importing.12 View data in EMPLOYEE_DIM and you should see that SKEY has been

added.

Using the Table_Comparison, History_Preserving and Key_Generation transforms 1 In the Designer project area, open CDC_DF.2 Delete the existing Query transform and EMPLOYEE_DIM from

CDC_DF.3 Add a new Query transform and EMPLOYEE_DIM target to the data

flow.4 Connect ODS_EMPLOYEE, the Query transform, and

EMPLOYEE_DIM.Notice that SKEY has now been added to the Query Schema Out pane in the Query Editor.

5 Remap the source columns to the appropriate target columns:

6 In the Query Schema Out pane, right-click SKEY, and select Delete.Note: You want to delete Skey from the Query transform because you

want to use the Key_Generation transform to declare the Insert SKEY values instead.

7 In CDC_DF, delete the connection between the Query transform and the target EMPLOYEE_DIM table.

8 Add these transforms in between the Query transform and the target table:

EMP_ID EMP_ID

LNAME LNAME

FNAME FNAME

REGION REGION

LAST_UPDATE TIMESTAMP


• Table_Comparison: this transform examines and compares the source and target tables. It examines which rows have been updated in the source table, and also which rows are new inserts into the source table.

• History_Preserving transform: this transform turns all update statements into an insert, and leaves insert statements as inserts.

• Key_Generation transform: this transform creates the surrogate key on the target table, which is simply the max existing surrogate key.

9 Connect the source, transforms and targets.

10 Double-click the Table_Comparison transform to open it. 11 In the Table_Comparison Editor Table Name field, select

EMPLOYEE_DIM.12 In the Generated key column field, select Skey.13 From the Schema In column drag Emp_ID into the Input primary key

columns area. 14 From the Schema In column, drag all the columns into the Compare

columns area.

You will use this range of columns to compare between the source and target tables.

15 We don’t need to do anything in the History Preserving transform because this transform automatically turns all updates into INSERT statements to preserve history for the existing rows.

16 Double-click the Key_Generation transform.17 In the Key_Generation Editor Table Name field, select

EMPLOYEE_DIM.18 In the Generated key column field, select SKEY.19 In the Increment Value field, keep the default value of 1.20 Validate, save and execute the job.


21 Click to display the job statistics.Note that the History_Preserving and Key_Generation transform processes 0 rows from the source to the target since there have been no updates to the ODS_Employee table:

22 View data for EMPLOYEE_DIM. You should still have 4 rows.Note the values in the SKEY column.

Making and capturing changes in your data1 In SQL Server Enterprise Manager, from the ODS source, right-click

ODS_EMPLOYEE, point to Open Table, and select Return all rows.2 Update the Region field for Jane Jones to 5 in the existing employee

record, and in the Designer, execute the job again.

3 Click to display the job statistics.Note that the History_Preserving and Key_Generation transforms processes one row from the source to the target because you updated one record in the ODS_Employee table:

4 View data for EMPLOYEE_DIM.

Note that EMP_ID =3 now has two rows —one with the original SKEY value of 3, and one with a value of 5. This SKEY value is assigned the next row increment in the table and indicates the most recent change in EMP_ID = 3.


5 Add this new record to ODS_EMPLOYEE and save it:• EMP_ID: 5• LNAME: JAMES• FNAME: MARK• REGION: 9• ADDRESS: CELL 5• CITY: SAN FRANCISCO• STATE: CA• ZIP: 94050• COUNTRY: USA• LAST_UPDATE: Use today’s date.

6 In the Designer, run the job again.7 Click the statistics icon to display the job statistics.

You should see a row count of 1 displayed for the Key_Generation and History-Preserving transforms.

8 View data for EMPLOYEE_DIM.You should have 6 rows. Note that now you have a new EMP_ID = 5 for Mark James with an SKEY value of 6. Again, this SKEY value takes on the next row increment value to indicate a change has occurred in the source table:

9 In ODS_EMPLOYEE, change the first name for EMP_ID = 5 from Mark to Marcus, and execute the job again.

10 Click the statistics icon to display the job statistics.You should see a row count of 1 displayed for the Key_Generation and History_Preserving transforms.

11 View data for EMPLOYEE_DIM.You should see that you have a new row for EMP_ID = 5 with an SKEY value of 7:

A very important test for CDC_Job is to makes sure that your target is not updated when there are no changes EMPLOYEE_DIM.

12 Execute the job again without making any changes to ODS_EMPLOYEE.You should see that no records are added to EMPLOYEE_DIM. The job statistics tab should also display a value of 0 row count.


Lesson Summary

Quiz: Capturing Changes in Data1 What are the two most important reasons for using CDC?

2 What method of CDC is preferred for the performance gain of extracting the least rows?

3 What is an initial load?

4 What is a delta load?

5 What type of slowly changing dimensions is this combination of transforms used for? Table Comparison, Key Generation, and History Preserving.

Review


After completing this lesson, you are now able to:• Explain what Changed Data Capture (CDC) is• Using Changed Data Capture (CDC) with time-stamped sources• Create an initial-load job• Create a delta-load job• Explain what history preservation is• Identify history preserving transforms

Summary


Lesson 10Handling Errors and Auditing

Data Integrator allows you to recover from job exceptions with minimal downtime. Recovery mechanisms are available for batch jobs only.

In this lesson you will learn about:• Understanding recovery mechanisms• Understanding the use of auditing in data flows

Duration: 2 hours


Understanding recovery mechanisms

If a Data Integrator job does not complete properly, you must fix the problems that prevented the successful execution of the job.

After completing this unit, you will be able to:• List levels of data recovery strategies• Recover a failed job using automatic recovery and marking recovery units

for executing jobs• Create recoverable work flows• Use Try/Catch blocks to specify alternate work flow options in the event

of job errors• Process data with problems

During the failed job execution, some data flows in the job may have completed and some tables may have been loaded, partially loaded, or altered. Therefore, you need to design your data movement jobs so that you can recover—that is, rerun the job and retrieve all the data without duplicate or missing data. It is important to note that some recovery mechanisms are for use production systems and are not supported in development environments.

There are different levels of data recovery and recovery strategies. You can:• Recover your entire database: use your standard DBMS services to

restore crashed data cache to an entire database.• Recover a partially loaded job: use automatic recovery or manually

create recoverable work flow using status tables• Recover from partially loaded tables: use the auto-correct load feature,

the Table_Comparison transform, do a full replacement of the target, or include a preload SQL command to avoid duplicate loading of rows when recovering from partially loaded tables.

• Check data anomalies: use the Validation transform to check data anomalies discovered during data profiling to ensure Data Integrator only transfers valid data to the target system and that it manages incorrect data appropriately.The Validation transform captures rules for valid data and tests data flowing through the ETL process for these conditions. It also provides built-in exception handling facilities to handle data that failed your validation criteria.

Depending on the relationships between data flows in your application, you may use a combination of these techniques to recover from exceptions.

Introduction

Listing levels of data recovery strategies

Handling Errors and Auditing—Learner’s Guide 10-3

Partially loaded jobs and automatic recovery This Data Integrator feature allows you to run unsuccessful jobs in recovery mode.

Automatic recovery allows a failed job to be rerun starting from the point of failure. Automatic recovery only works if the job is unchanged, so it is only useful in production (not in development or test) environments.

When you use the automatic recovery feature, you must enable the feature during initial execution of a job. Data Integrator saves the results from successfully completed steps when the automatic recovery feature is enabled.

To enable automatic recovery in a job from the Designer1 From the Local object library, drag a project into the Designer project

area. 2 Right-click the job name, and select Execute.3 On the Parameters tab in the Execution Properties window, select the

Enable recovery check box.If this check box is not selected, Data Integrator does not record the results from the steps during the job and cannot recover the job if it fails.

4 Click OK.

Marking recovery unitsIn some cases, steps in a work flow depend on each other and must be executed together. Because of the dependency, you should also designate the work flow as a recovery unit. When a work flow is a recovery unit, the entire work flow must complete successfully. If the work flow does not complete successfully, Data Integrator executes the entire work flow during recovery, including the steps that executed successfully in prior work flow runs.

To specify a work flow as a recovery unit1 From the Designer Local object library, drag a project into the project

area. 2 Right-click the work flow, and select Properties.3 On the Properties window, select the Recover as a unit check box.

Note: Data flows and work flows that have been set to Execute once only do not reexeute after the first run of the job is completed. Therefore, it is recommended that you do not mark a work flow or data flow as Execute only once when the work flow or a parent work flow is a recovery unit.


Recovery modeIf a job with automated recovery enabled fails during execution, you can execute the job again in recovery mode. During recovery mode, Data Integrator retrieves the results for successfully completed steps and reruns uncompleted or failed steps under the same conditions as the original job. For example, suppose a daily update job running overnight successfully loads dimension tables in a warehouse. However, while the job is running, the database log overflows and stops the job from loading fact tables. The next day you truncate the log file and run the job again in recovery mode.

In recovery mode, Data Integrator executes the steps or recovery units that did not complete successfully in a previous execution. This includes steps that failed and steps that generated an exception but completed successfully such as those in a try/catch block. As in normal job execution, Data Integrator executes the steps in parallel if they are not connected in the work flow diagrams and in serial if they are connected.Note: The recovery job does not reload the dimension tables in a failed job

because the original job, even though it failed, successfully loads the dimension tables.

To ensure that the fact tables are loaded with the data that corresponds properly to the data already loaded in the dimension tables, make sure:• Your recovery job must use the same extraction criteria that your original

job used when loading the dimension tables. If your recovery job uses new extraction criteria—such as basing data extraction on the current system date—the data in the fact tables will not correspond to the data previously extracted into the dimension tables. If your recovery job uses new values, then the job execution may follow a completely different path through conditional steps or try/catch blocks.

• Your recovery job must also follow the exact execution path that the original job followed. Data Integrator records any external inputs to the original job so that your recovery job can use these stored values and follow the same execution path.

When recovery is enabled, Data Integrator stores results from the following types of steps:• Work flows• Batch data flows• Script statements• Custom functions (stateless type only)• SQL function• exec function• get_env function• rand function• sysdate function• systime function


To enable recovery mode in a job from the Designer 1 In the project area, select the job name for the job that failed.2 Right-click the job, and select Execute.3 In the Execution Properties window, on the Parameters tab, select the

Recover from last execution check box.Note: This option is not available when a job has not yet been executed,

the previous job run succeeded or when recovery mode was disabled during the previous run.

To specify that a job executes the work flow one timeYou can specify that a work flow should only execute once. A job never reexeutes that work flow; even if the work flow is contained in a work flow that is a recovery unit that reexeutes. Note: Business Objects recommends that you do not mark a work flow as

“Execute only once” if the work flow or a parent work flow is a recovery unit.

1 Right click the work flow and select Properties.The Properties window opens for the work flow.

2 Select the Execute only once check box.3 Click OK.


To specify that a batch job executes the data flow one timeWhen you specify that a data flow should only execute once, a batch job will never reexeute that data flow after the data flow completes successfully. This also applies to a data flow that is contained in a work flow that is a recovery unit that reexeutes. Tip: You want to set a data flow or a work flow to execute only once, if you are

recovering from a disaster. For example, if you cannot redo the work flow, or data flows within it.

Note: Business Objects recommends that you do not mark a data flow as Execute only once if a parent work flow is a recovery unit.

1 Right-click the data flow and select Properties.The Properties window opens for the data flow.

2 Select the Execute only once check box.

3 Click OK.

Partially loaded dataExecuting a failed job again, may result in duplication of rows that were loaded successfully during the first job run.

When recovering a job, you do not want to insert duplicate rows when the data flow reexecutes. Within your recoverable work flow you can use several methods to ensure that you do not insert duplicate rows.

You can:• Include the Table_Comparison transform in your data flow when you

have tables with more rows and less fields, such as fact tables.• Design the data flow to completely replace the target table during each

execution. This technique can be optimal when the changes to the target table are numerous compared to the size of the table.


• Use auto-correct load for the target table when you have tables with less rows and more fields, such as dimension tables. The auto-correct load checks the target table for existing rows before adding new rows to the table. Using the auto-correct load option, however, can slow jobs executed in non-recovery mode. Consider this technique when the target table is large and the changes to the table are relatively few.Note: The auto-correct load option generates updates even for

unchanged rows. For large amounts of data, use a Table_Comparison transform to avoid this.

You can enable the auto-correct load option by selecting it from the Target Table Editor.

• Include a SQL command to execute before the table loads. Preload SQL commands can remove partial database updates that occur during incomplete execution of a step in a job. Typically, the preload SQL command deletes rows based on a variable that is set before the partial insertion step began. For more information on preloading SQL commands, see “Using preload SQL to allow re-executable data flows”, Chapter 18 in the Data Integrator Designer Guide.

You can do a manual recovery of a job by creating recoverable work flows using status tables.This technique allows you to rerun jobs without regard to partial results in a previous run.

Data Integrator executes the entire work flow during recovery, including steps that executed successfully in prior work flow runs.

Recoverable work flows must have certain characteristics:• The recoverable work flows can be run repeatedly.• The job implements special steps to recover data when a step did not

complete successfully during a previous run.• The recoverable work flow includes a recovery strategy that executes

only in recovery mode.

Recoverable work flows also consist of three objects:• A script to determine if recovery is required.

This script reads the ending time in the status table that corresponds to the most recent start time. This indicates that the previous work flow may not have completed properly if there is no ending time for the most recent starting time.

• A conditional that calls the appropriate data flow.• A script to update a status table, signifying successful execution.

Creating recoverable work flows


This script executes after the work flows in the conditional have completed. Its purpose is to update the status table with the current timestamp to indicate successful execution.

To create a manual, recoverable work flowFor the procedures below, make sure you give the job, work flow and data flow distinctive names that will help you identify them. For example, you may want to name your job Recovery_Job.1 Create a new job.2 Add the variables below to your job:

The variable $Recover_Needed determines whether or not to run a data flow in recovery mode. The $End_Time variable determines the value of the recovery needed.

3 Add a work flow to your job.4 Add a script to your work flow to get the ending time in the status table

that corresponds to the most recent start time. Rename your script to GetStatus.

5 Double-click the script to define the script in your work flow. The script content depends on the RDBMS on which status_table resides, as shown below. $end_time = sql('Target_DS', 'selectconvert(char(), end_time, 0) from status_table wherestart_time = (select max(start_time) fromstatus_table)');

Variable name Type

$Recovery_Needed int

$End_TIme varchar (length)

5

3

4

GetStatus Conditional SetStatus

$StopStamp = sql(‘target_ds’, (‘SELECT stop_tiemstamp FROM status_table WHERE (start_timestamp) = (SELECT MAX (start_timestamp) FROM status_table)’)) $recovery_needed = 1; Else $recovery_neede

$stop_date = sql(‘target_ds’, (UPDATE status_table SET stop_timestamp = SYSDATE WHERE start_timestamp) FROM status_table)’));

1

2


if (($end_time = NULL) or ($end_time = ''))$recovery_needed = value;else $recovery_needed = value;

Note: Make sure the entire statement is contained on a single line.6 Add another script to your work flow that updates the status table with the

current timestamp to indicate successful execution. Rename your script to SetStatus.

7 Double-click the script to define the script in your work flow. The script content depends on the RDBMS on which status_table resides, as shown below. sql('Target_DS', 'update status_table set end_time =getdate() where start_time = (select max(start_time)from status_table)');

Note: These scripts are for use with SQL Server only.Make sure the entire statement is contained on a single line.

8 From the Tool Palette, click , and drag it onto the workspace and rename your conditional.

9 Double-click the Conditional to open it.10 In the IF field on the Conditional Editor window, enter the expression that

evaluates to the variable $recovery_needed to true or false. In this case ($Recovery_Needed = value).

11 From the Local object library, drag a data flow into the Then box of your Conditional editor. The Then box denotes the true branch of the conditional you are setting.

12 From the Local object library, drag the data flow into the Else box of your conditional editor.The Else box denotes the false branch of the conditional you are setting.

13 Connect the two scripts and the conditional together.

14 Click to validate the work flow, conditional, and scripts.15 Execute the job.

The first time this job is executed, the recovery mode data flow is called. This is due to the end_time field being NULL in status_table.

16 Check the trace log to see which data flow was called.17 Use a query tool to check the contents of your target table.18 Check the contents of status_table and notice that the end_time field now

contains a value.19 Run the job again.

This time the non-recovery mode data flow will run. Again, you can check this by examining the trace log messages.


Activity: Creating a recoverable work flow

ObjectivesIn this activity you will:• Create a recoverable job that loads the sales organization dimension

table. • Create a work flow that uses Data Integrator conditionals and use the

auto-correction option in the target table loader

This recoverable job consists of three objects:• A script to determine if recovery is required• A conditional that calls the appropriate data flow• A script to update a status table with the current timestamp

Instructions

To create the script to determine if recovery is required1 Create a new job named Recovery_Job.2 Declare these global variables within the job:

• $recovery_needed: int• $end_time: varchar(20)• (For the Informix environment only)• $max_start_time: varchar(20)The $recovery_needed variable is used to determine whether or not to run a data flow in recovery mode. $end_time is used to determine the value of $recovery_needed.

3 Add a script in Recovery_Job. You do not need to create a work flow for this activity.

4 Rename the script to GetWFStatus. 5 Enter the script contents:

$end_time = sql('Target_DS', 'selectconvert(char(20), end_time, 0) from status_table wherestart_time = (select max(start_time) fromstatus_table)');if (($end_time = NULL) or ($end_time = ''))$recovery_needed = 1;else $recovery_needed = 0;

Tip: Make sure the entire statement is contained on a single line and not on multiple lines. You can copy and paste the script from the Resource CD. From the Resource CD, browse to the Activity_Source folder, and open Lesson_10_scripts.txt.

Tip: Validate the script.

Practice


To create a conditional that calls the appropriate data flow1 Create the recovery mode data flow by replicating the SalesOrg_DFdata

flow. Note: The default name for the replicated data flow is

Copy_1_SalesOrg_DF.2 Rename the data flow you just created to SalesOrg_ACDF, and enable

auto-correct load for the target table.Tip: auto-correct load is an option for the target table.Note: The Recovery_WF calls a modified version of SalesOrg_DF. The

only difference is that the table loader for the salesorg_dim table will have the auto correction option set. The auto-correct load option causes the table loader to check for existing rows with the same key as each INSERT/UPDATE. When there is a match, the pre-existing row will be deleted before inserting a new row (UPDATEs are mapped to INSERTs).

3 Add a conditional to Recovery_Job, and rename this conditional to Recovery_needed.

4 Double-click the Recovery_needed conditional to enter the IF expression that evaluates to true or false.In this instance, enter ($recovery_needed = 1).

5 Select SalesOrg_ACDF, and drag it into the upper (Then) partition of the conditional. This data flow is for the true branch of the conditional, and specifies the recoverable work flow for the conditional.

6 Open the data flow tab in the object library, and drag SalesOrg_DF into the lower (Else) partition of the conditional. This data flow is for the false branch of the conditional and specifies the normal work flow for this conditional.

To create a script to update a status table1 Open the job named Recovery_Job.2 Create a script to the right of the Recovery_needed conditional and name

it UpdateWFStatus.3 Open the script named UpdateWFStatus.4 Enter the script contents:

sql('Target_DS', 'update status_table set end_time =getdate() where start_time = (select max(start_time)from status_table)');\


5 Validate the script.


To specify the job execution order for the recoverable work flow1 Connect the objects in the job named Recovery_Job so that they execute

in this order:GetWFStatus > RecoveryNeeded > UpdateWFStatus

2 Before executing this job, use a query tool and delete the existing data from the table named SALESORG_DIM using this script:Delete SALESORG_DIM

3 Check the contents of status_table and notice that the end_time field is NULL.

4 Execute Recovery_Job.5 Check the trace log to see which data flow was invoked.6 Use a query tool to check the contents of the SALESORG_DIM table.

There should be three rows.7 Check the contents of status_table and notice that the end_time field now

contains a value.8 Execute the job again.

This time the non-recovery mode data flow will run. Again, you can check this by examining the trace log messages.

The try/catch blocks are single use objects that can be used in a job or work flow.

A try is part of a serial sequence called a try/catch block. The try/catch block allows you to specify alternative work flows if errors occur during job execution. Try/catch blocks catch classes of errors, apply solutions that you provide, and continue execution.

For each catch in the try/catch block, specify:• One exception or group of exceptions handled by the catch. To handle

more than one exception or group of exceptions, add more catches to the try/catch block.

• The work flow to execute if the indicated exception occurs. Use an existing work flow or define a work flow in the catch editor.

If an exception is thrown during the execution of a try/catch block, and if no catch is looking for that exception, then the exception is handled by normal error logic.

Using try/catch blocks to specify alternate work flow options


Try/catch blocks and automatic recoveryData Integrator does not save the result of a try/catch block for reuse during recovery. If an exception is thrown inside a try/catch block, then during recovery Data Integrator executes the step that threw the exception and subsequent steps.

Because the execution path through the try/catch block might be different in the recovered job, using variables set in the try/catch block could alter the results during automatic recovery.

For example, suppose you create a job that defines a variable, $i, that you set within a try/catch block. If an exception occurs, you set an alternate value for $i. Subsequent steps are based on the value of $i.

During the first job execution, the first work flow contains an error that generates an exception, which is caught. However, the job fails in the subsequent work flow.

You fix the error and run the job in recovery mode. During the recovery execution, the first work flow no longer generates the exception. Thus the

$i = 10 IF $i < 1

$i = 0

Job execution logic

IF $i < 1=FALSE

IF $i < 1=TRUE

X

An error occurs while processing this work flow

An exception is thrown and caught in the first execution

First job execution


value of the variable, $i, is different, and the job selects a different subsequent work flow, producing different results.

To ensure proper results with automatic recovery when a job contains a try/catch block, do not use values set inside the try/catch block or reference output variables from a try/catch block in any subsequent steps.

Jobs might not produce the results you expect because of problems with data. In some cases, Data Integrator is unable to insert a row. In other cases, Data Integrator might insert rows with missing information.

You can design your data flows to anticipate and process these types of problems. For example, you might have a data flow write rows with missing information to a special file that you can inspect later.

This section introduces you to some mechanisms for processing data with problems.

Overflow filesA row that cannot be inserted is a common data problem. There are many reasons for loading to fail. These reasons can include duplicate key values, overflow in column settings, and out of memory for the targets. Overflow files help you process this type of data problem.

When you specify an overflow file and Data Integrator cannot load a row into a table, Data Integrator writes the row to the overflow file instead. The trace log indicates the data flow in which the load failed and the location of the file.

You can use the overflow information to identify invalid data in your source or problems introduced in the data movement. Every new run will overwrite the existing overflow file.

Processing data with problems

XThe execution path changes because of the results from the try/catch block

No exception is thrown in the recovery execution

Recovery execution


To use an overflow file in a job1 Double-click the target table in your data flow.2 Click the Options tab of the Target Table Editor.3 Under the Error handling section, select the Use overflow file check box.4 In the Overflow file name: field, type a filename.

Note: When you specify an overflow file, give a full path name to ensure that Data Integrator creates a unique file when more than one file is created in the same job:

5 In the Overflow file format: list, select what you want Data Integrator to write to the file about the rows that failed to load.• If you select Write data, you can use Data Integrator to read the data

from the overflow file, cleanse it, and load it into the target table. • If you select Write sql, you can use the commands to load the target

manually when the target is accessible.

Missing valuesA missing or invalid value in the source data is another common data problem. Using queries in data flows, you can identify missing or invalid values in source data. You can also choose to include this data in the target or to disregard it.

For example, suppose you are extracting data from a source and you know that some phone numbers and customer names are missing. You can use a data flow to extract data from the source, load the data into a target, and filter the NULL values into a file for your inspection.


This data flow has five steps, as illustrated:

1 Extracts data from the source2 Selects the data set to load into the target and applies new keys. It does

this by using the Key_Generation function.3 Loads the data set into the target.4 Uses the same data set for which new keys were generated in step 2,

and select rows with missing customer names and phone numbers.5 Writes the customer IDs for the rows with missing data to a file.

Loads Data

1

2

Source Query

Customer

Missing DataMissing

Customer File

SELECT Query.CustomerID, Query.NAME,

Query.PHONE

FROM Query

WHERE(NAME=NULL)OR(PHONE=NULL);

10002,,(415)366-1864

20030, Tanaka,

21101, Navarro,

17001,,(213)433-2219

16401,,(609)771-5123

Key_generation(‘target_ds.owner.Customer’,’Customer_Gen_Key’,1)

3

4 5


Understanding auditing in data flows

The Auditing feature provides a way to ensure that a data flow loads correct data into the warehouse.

You can collect audit statistics on the data that flows out of any Data Integrator object, such as a source, transform, or target. If a transform has multiple distinct or different outputs (such as Validation or Case), you can audit each output independently.

After completing this unit, you will be able to:• Define audit points and rules• Define audit actions on failures• View audit results• Resolve invalid audit labels• Explain guidelines for choosing audit points

When you audit data flows, you:• Define audit points to collect run time statistics about the data that flows

out of objects. These audit statistics are stored in the Data Integrator repository.

• Define rules with these audit statistics to ensure that the data extracted from sources, processed by transforms and loaded into targets is what you expect.

• Generate a run time notification that includes the audit rule that failed and the values of the audit statistics at the time of failure.

• Display the audit statistics after the job execution to help identify the object in the data flow that might have produced incorrect data.

• The auditing feature is accessible when you right-click a data flow object in the object library.

The Label tab displays the sources and targets in the data flow.

Introduction

Defining audit points and rules


Note: If your data flow contains multiple consecutive query transforms, the Audit Editor shows the first query.

You can display different views in the Audit Editor using these icons:

Audit pointsAn audit point represents the object in a data flow where you collect statistics. You can audit a source, a transform, or a target in a data flow.

When you define audit points on objects in a data flow, you specify an audit function.

An audit function represents the audit statistic that Data Integrator collects for a table, output schema, or column. You can choose from these audit functions:

Icon Tool tip Description

Collapse All Collapses the expansion of the source, transform, andtarget objects.

Shows All Objects Displays all the objects within the data flow.

Show Source,Target andfirst Query

Default display which shows the source, target, and first query objects in the data flow. If the data flow contains multiple consecutive query transforms, only the first query displays.

ShowLabelledObjects

Displays the objects that have audit labels defined.

Data object Function Description

Table or outputschema

Count This function collects two statistics:

Good count for rows that were successfully processed.

Error count for rows that generated some type of error if you enabled error handling.

Column Sum Sum of the numeric values in thecolumn. Applicable data types includedecimal, double, integer, and real.This function only includes the Goodrows.


The default data type for each audit function and the permissible data types are:

Audit label namesAn audit label represents the unique name in the data flow that Data Integrator generates for the audit statistics collected for each audit function that you define.

You use these labels to define audit rules for the data flow. You can also edit the label names to create a shorter meaningful name or to remove dashes.Note: Dashes are not allowed in label names.

If the audit point is on a table or output schema, these two labels are generated for the Count audit function:$Count_objectname

$CountError_objectname

If the audit point is on a column, the audit label is generated with this format:$auditfunction_objectname

Audit rulesAn audit rule is a Boolean expression which consists of a Left-Hand-Side (LHS), a Boolean operator, and a Right-Hand-Side (RHS):• The LHS can be a single audit label, multiple audit labels that form an

expression with one or more mathematical operators, or a function with audit labels as parameters.

• The RHS can be a single audit label, multiple audit labels that form an expression with one or more mathematical operators, a function with audit labels as parameters, or a constant.

Note: If you define multiple rules in a data flow, all rules must succeed or the audit fails.

Column Average Average of the numeric values in the col-umn. Applicable data types include decimal, double, integer, and real. This function only includes the good rows.

Column Checksum Checksum of the values in the column.

Audit functions Default data types Allowed data types

Count INTEGER INTEGER

Sum Type of audited column INTEGER, DECIMAL, DOUBLE, REAL

Average Type of audited column INTEGER, DECIMAL, DOUBLE, REAL

Checksum VARCHAR(128) VARCHAR(128)

Data object Function Description


These are Boolean expressions examples of audit rules:$Count_CUSTOMER = $Count_CUSTDW

$Sum_ORDER_US + $Sum_ORDER_EUROPE = $Sum_ORDER_DW

round($Avg_ORDER_TOTAL) >= 10000

ExampleUse auditing rules if you want to compare audit statistics for one object against one other object. For example, you can use an audit rule if you want to verify that the count of rows from the source table is equal to the rows in the target table.

To define audit points and rules in a data flow1 Right-click a data flow, and select Audit.2 In the Audit Editor Label tab, right-click a source, transform or target

object you want to audit, and select Properties.3 In the Schema Properties Editor, on the Audit tab, click the drop-down list

beside Audit function.4 Select the audit function you want to use against this data object type,

and click OK.

Note: The audit functions displayed in the drop-down menu depend on the data object type that you have selected.

5 In the Audit Editor, click the Rule tab.6 Under Auditing Rules, click Add.


Notice that when you click Add, you activate the expression editor and the Custom options become available for use:

The expression editor contains three drop-down lists where you specify the audit labels for the objects you want to audit and choose the Boolean expression to use between these labels.

7 From the first drop-down list in the expression editor, select the audit label for the object you want to audit.

8 Select a Boolean expression.9 Select the audit label for the second object you want to audit.

10 Click Close.Note: If you want to compare audit statistics for one or more objects

against statistics for multiple other objects or a constant, select the Custom radio button, and click the ellipsis button beside Functions. This opens up the full-size smart editor where you can drag different functions and labels to use for auditing. To access the labels in the smart editor, click the Variables tab.

11 Validate the data flow.12 In the Designer project area, right-click to execute the job.13 In the Execution Properties window, click the Trace tab.14 Scroll down to find Trace Audit Data.15 Highlight Trace Audit Data and set the value to Yes.


The job executes and the job log displays the Audit messages based on the audit function that is used for the audit object:

You can choose any combination of the actions listed for notification of an audit failure. If you choose all three actions, Data Integrator executes them in this order:1 Email to list: Data Integrator sends a notification of which audit rule failed

to the email addresses that you list in this option. Use a comma to separate the list of email addresses. You can specify a variable for the email list. This option uses the smtp_to function to send email. Therefore, you must define the server and sender for the Simple Mail Tool Protocol (SMTP) in the Data Integrator Server Manager.

2 Script: Data Integrator executes the custom script that you create in this option.

3 Raise exception: this action is the default. When you choose the Raise exception option and an audit rule fails, the JobError Log shows the rule that failed. The job stops at the first audit rule that fails. This is an example of a message in the Job Error Log:Audit rule failed <($Count_ODS_CUSTOMER = $CountR1)> for<Data flow Case_DF>.

Note: If you clear this action and an audit rule fails, the job completes successfully and the audit does not write messages to the job log. You can view which rule failed in the Auditing Details report in the Metadata Reporting tool. The Metadata Reporting Tool is discussed later in the course.

To add audit functions on failure1 Right-click a data flow, and select Audit.1 Select the object you want to audit.2 Specify your audit functions and rules.3 In the Rule tab of the Audit Editor, under Action on Failure, select the

action you want.Note: The email text box and expression editor are activated when you

select these options respectively.

Defining audit actions on failure


You can see the audit status in one of these places:

An audit label can become invalid if you delete or rename an object that had an audit point defined on it.

To resolve invalid audit labels1 Right-click a data flow, and select Audit.

2 Expand the Invalid Labels node to display the individual labels.3 Note any labels that you would like to define on any new objects in the

data flow.4 After you define a corresponding audit label on a new object, right-click

the invalid label, and select Delete.Tip: If you want to delete all of the invalid labels at once, right-click

Invalid Labels, and click Delete All.

Viewing audit results

Action on FailurePlaces where you can view audit information

Raise an exception Job Error Log, Metadata Reports

Email to list Email message, Metadata Reports

Script Wherever the custom script sends the auditmessages, Metadata Reports

Resolving invalid audit labels


When you choose audit points, consider:• The output data of an object

The Data Integrator optimizer cannot push down operations after the audit point. Therefore, if the performance of a query that is pushed to the database server is more important than gathering audit statistics from the source, define the first audit point on the query or later in the data flow.For example, suppose your data flow has a source, query, and target objects, and the query has a WHERE clause that is pushed to the database server that significantly reduces the amount of data that returns to Data Integrator. Instead, define the first audit point on the query, rather than on the source, to obtain audit statistics on the query results.

• If a pushdown_sql function is after an audit point, Data Integrator cannot execute it.

• The auditing feature is disabled when you run a job with the debugger.• If you use the CHECKSUM audit function in a job that normally executes

in parallel, Data Integrator disables the Degrees of Parallelism (DOP) for the whole data flow. The order of rows is important for the result of CHECKSUM, and DOP processes the rows in a different order than in the source.

Note: For more information on DOP, see “Using Parallel Execution” and “Maximizing the number of push-down operations” in the Data Integrator Performance Optimization Guide.

Explaining guidelines for choosing audit points


Activity: Using auditing in a data flow

ObjectivesIn this activity you add auditing to Case_DF to verify that all source rows were processed by the Case transform.

Instructions1 Open Case_DF.2 Add another Target template table and call it R123.

You will use this template table to capture all the rows from ODS_Customer source.

3 Double-click the Case transform.4 In the Case Editor, add a label and expression:

• Label: R123• Expression:

ODS_CUSTOMER.REGION_ID in (1,2,3)5 Click the back arrow to go back to the data flow level.

6 From the Designer toolbar, click .7 In the Audit Editor, select ODS_Customer, and right-click Properties.8 In the Audit tab of the Schema Properties, select the Count function from

the drop-down list.Tip: You can also right-click ODS_Customer, and select Count from the

options listed.9 Repeat the procedures to add an audit function to R123.10 Click the Rule tab, and click Add to add an auditing rule.11 From the expression editor, select the Boolean expression, and the first

and second objects you want to audit as shown:

Practice


12 Click Close.13 Validate the data flow.14 In the Designer project area, right-click Case_Job and select Execute.15 In the Execution Properties window, click the Trace tab.16 Set Trace Audit Data to Yes.

Your audit messages in the job log should look like this:

17 Go back to the data flow level.18 Click the magnifying glass on ODS_CUSTOMERS to view data.19 Click the magnifying glass on R123 to view data.

Your row count for both tables should be 12.Tip: Do not close the view data window for ODS_CUSTOMERS when

view data for R123. This way you can display both data sets side by side as shown:


Lesson Summary

Quiz: Handling Errors and Auditing1 List the different strategies you can use to avoid duplicate rows of data

when re-loading a job.

2 True or False: You can only run a job in recovery mode after the initial run of the job has been set to run with automatic recovery enabled.

3 True or False: Automatic recovery allows a failed job to be re-run starting from the point of failure. Automatic recovery only works if the job is unchanged, thus, you can use it for production, development and test environments.

4 What are the two scripts in a manual recovery work flow used for?

5 What is data flow auditing?

6 What must you define in order to audit a data flow?

• True or False. The auditing feature is disabled when you run a job with the debugger.

Review


7 Which setting has the highest precedence in recovery mode?• Recover as a Unit• Enable Recovery• Execute Only Once• Recover from Last Execution

After completing this lesson, you are now able to:• List levels of data recovery strategies• Recover a failed job using automatic recovery and marking recovery units

for executing jobs• Run a job in recovery mode when automatic recovery fails• Use Try/Catch blocks to specify alternate work flow options in the event

of job errors• Create a manual, recoverable work flow using status tables• Use Try/Catch blocks to specify alternate work flow options in the event

of job errors• Process data with problems• Define audit points and rules• Define audit actions on failures• View audit results• Explain guidelines for choosing audit points• Resolve invalid audit labels

Summary


Lesson 11Supporting a Multi-user Environment

Data Integrator supports multi-user developments in an application. You can work on your own local repository while sharing and storing work in a central repository with others.

In this lesson you will learn about:• Working in a multi-user environment• Setting up a multi-user environment• Describing common tasks

Duration: 1 hour


Working in a multi-user environment

The term multi-user is used to refer to environments where multiple developers use local and central repositories while working on interdependent parts of a given project.

After completing this unit, you will be able to:• Explain the stages of Data Integrator’s development process in a multi-

user environment• Describe terminology used in a multi-user environment• Explain repository types in a multi-user environment

The development process you use to create your Extract, Transform, and Load (ETL) application involves three distinct phases: design, test, and production. Each phase might require a different computer in a different environment and different security settings. For example, the design and initial test might only require limited sample data and low security, while the final testing might require a full emulation of the production environment including strict security.

To control the environmental differences, each phase might require a different repository. Objects created in the design environment must be moved to the testing environment, and finally to the production environment in a controlled way.

DesigningIn this phase, you define objects and build diagrams that instruct Data Integrator in your data movement requirements. Data Integrator stores these specifications so you can reuse them or modify them as your system evolves.

Design your project with migration to testing and final production in mind. Consider these basic guidelines as you design your project:• Construct design steps as independent, testable modules• Use meaningful names for each step you construct• Make independent modules that can be used repeatedly to handle

common operations• Use test data that reflects all the variations in your production data

Introduction

Explaining the Data Integrator development process

Supporting a Multi-user Environment—Learner’s Guide 11-3

TestingIn this phase, you use Data Integrator to test the execution of your application. At this point, you can test for errors and trace the flow of execution without exposing production data to any risk. If you discover errors during this phase, return the application to the design phase for correction, then test the corrected application.

Testing has two parts:• The first part includes designing the data movement using your local

repository.• The second part includes fully emulating your production environment,

including data volume.

Data Integrator provides feedback through trace, error, and statistics logs during both parts of this phase.

The testing repository should emulate your production environment as closely as possible, including scheduling jobs rather than manually starting them.

ProductionIn this phase, you set up a schedule in Data Integrator to run your application as a job. Evaluate results from production runs and when necessary, return to the design phase to optimize performance and refine your target requirements.

After moving a Data Integrator application into production, monitor it in the Administrator for performance and results.

During production:• Monitor your jobs and the time it takes for them to finish.

The trace and monitoring logs provide information about each job as well as the work flows and data flows contained within the job. You can customize the log details. However, the more information you request in the logs, the longer the job runs. Balance job run-time against the information necessary to analyze job performance. For more information see the Data Integrator Performance Optimization Guide.

• Check the accuracy of your data.

Working in a multi-user environment requires you to share work with others. In Data Integrator you share work with others through the import/export or checking in and checking out of objects amongst many other object related tasks. The next section discusses the terms that are used in a Data Integrator multi-user environment.

Test phase

Does not expose production data

Design phase

Define data movement requirements

Production phase

Use results to refine jobs


A multi-user environment affects how you use Data Integrator and how you manage different phases of an application. For success in a multi-user environment, you must maintain consistency between your local repository and the central repository.

The following terms apply when discussing multi-user environments and Data Integrator:• Highest level object

The highest level object is the object that is not a dependent of any object in the object hierarchy. For example, if Job1 is comprised of WF1 and DF1, then Job1 is the highest level object.

• Object dependentsObject dependents are objects associated beneath the highest level object in the hierarchy. For example, if Job1 is comprised of WF1, which contains DF1 one, then both WF1and DF1 are dependents of Job1. Further, DF1 is a dependent of WF1.

• Object versionAn object version is an instance of an object. A version of the object is created each time you add or check an object into the central repository. The latest version of an object is the last or most recent version created.

When working in a multi-user environment, you activate the link between your local repository and the corresponding central repository each time you log in. To ensure that your repository is current, you can get or copy the latest version of each object in the central repository. Once you get an application in your local repository, you can view and run it from the Designer.

Describing terminology used in a multi-user environment

Log on

Get objectsAct

ivat

e

Local Repository

Viewand run

Central Repository


Data Integrator allows you to create a central repository for storing the master copy of a Data Integrator application. The central repository contains all information normally found in a repository such as definitions for each object in an application.

However, the central repository is merely a storage location for this information. To change the information, you must work in a local repository.

A local repository provides a view of the central repository. Local repositories are used for:• Developing objects• Running jobs because Job Servers always work with a local repository

Central repositories are used for:• Storing code• Sharing code with other developersNote: Never log into a central repository. The central repository will act as if it

were a local repository if you log into it. You run the risk of corrupting version information.If you attempt to log into the central repository, the Data Integrator Designer will present a warning message. When you see this message, you should log out immediately and log back into a local repository.

You can also maintain multiple connections to central repositories.

Local and central repositories can be on the same machine hard drive or on separate hard drives.

You can get or copy objects from the central repository to your local repository. To make changes, check out an object from the central repository to your local repository. While you have an object checked out from the central repository, other users cannot change the information.

When done, you check in the changed object. When you check in objects, Data Integrator saves the new, modified objects in the central repository.

You can easily maintain your repositories by making sure that:• Earlier versions are accessible in a central repository only• Compact active repositories regularly for maximum efficiency• Use database permissions to limit who can compact a central repository• Nothing is overwritten in a repository: a version of an object may be

replaced as the most current version, but each new version is stored as a separate record in a repository table.

Explaining repository types in a multi-user environment

Central Repository

Local Repository

Get

Check out

Check in


Setting up a multi-user environment

You must configure a multi-user environment and set up several repositories to support a multi-user environment.

After completing this unit, you will be able to:• Create a central repository• Define a connection to the central repository• Activate the central repository

The central repository stores master information for the development environment. First, create a central repository, define a connection to the repository, and then activate the repository for use in a multi-user development environment.

To create a central repository1 In your DBMS, create a new database to use for the central repository. 2 Click the Start button, and then point to Programs. Point to the folder

that contains the Data Integrator Repository Manager, and then click Repository Manager.

3 In the Repository Manager window, select your database type from the database list.

4 In the Database server name: and Database name: name fields, type the database server name and database name that is associated to the database you just created.

5 In the User name: and Password: name fields, type the user name and password that is associated to the database you just created.

6 Under Repository type: select the Central check box.7 Click Create.

Data Integrator creates repository tables in the database you identified.

A team working on an application only needs one central repository. However, each team member requires a local repository. Each local repository also requires connection information to any central repository it must access.

A local repository was already created for you by your instructor for the purpose of this course. You use the same procedure as above, but select a local repository type when you create a local repository. For more information on creating a local repository, see “Creating a Data Integrator Repository”, Chapter 2 in the Data Integrator Note: The version of the central repository must match the version of the

local repository.

Introduction

Creating a central repository


To define a connection to a central repository1 Start the Data Integrator Designer and log into your local repository.2 On the Tools menu, click Central Repositories....

The Central Repository Connections option is selected by default in the Designer Options list.

3 Click Add. The Central Repository window appears.

4 In the Name box, type a name to identify your central repository.5 In the Database Type list, select the appropriate database type for your

central repository.6 In the Database server name: and Database name: name fields, type the

database server name and database name for your central repository.7 In the database server version: list, select the version of the database

you are running. 8 In the User name: and Password: name fields, type the user name and

password you associated with your central repository.9 Click OK.

The list of central repository datastores now includes the newly connected central repository.

Now that you have connected your local repository to a central repository, you must also activate the link between them. You must activate the central repository each time you log in.Note: Make sure you activate the correct central repository if you are working

with more than one central repository.

To activate a central repository1 In the Designer, on the Tools menu, select Central Repositories.2 In the Central Repository Connections list, select a central repository to

make active.3 Click Activate.

Data Integrator activates the link between your local repository and the selected central repository. The Central object library icon on the Designer toolbar is also activated.Note: The Reactivate automatically option allows you to reactivate the

central repository automatically the next time you log into your local repository.

4 On the Designer toolbar, click to display all the objects available in the central repository and check out status of each object. Connection

Defining a connection to a central repository

Activating a central repository


information about the activated central repository appears in the upper right corner of the Central object library.

You can also make changes to your connection to the central repository or delete the connection.

To edit connections in a central repository

• On the Central Object Library toolbar, click to edit.

To delete connections to a central repository1 In the Designer, on the Tools menu, select Central Repositories...2 In the Central Repository Connections list, select the central repository

from which you want to delete the connection, and click Deactivate.3 Click OK.4 Right-click the central repository and select Delete.

After confirming your selection, the connection information from this local repository is deleted. You can no longer connect to that central repository from this local repository.Note: You are not deleting the central repository; you are only deleting

the connection information between your local repository and this central repository.


Activity: Creating a central repository

ObjectivesIn this activity you will:• Create a central repository• Create two local repositories• Define connections to the central repository

InstructionsIn these exercises, two developers use a job to collect data for their company's HR department. Each developer has a local repository and they both share a central repository. 1 Add the three new databases listed below using MS SQL Server to set up

a multi-user environment.

2 Associate the users to the corresponding database and assign these rights to the Administrator:The Permit in Database Role rights are db_owner, default, and public.

3 Create a central repository in the Data Integrator Repository Manager by selecting the Central option for repository type. Use the database name, user name and password you just created for the central database.

4 Create two local repositories in the Data Integrator Repository Manager by selecting the Local option for repository type. Use the database names, user names and passwords you just created for databases user1 and user2.

5 Start the Data Integrator Designer and log in as user1 with password user1.

6 From the Designer Tools menu, add a new central repository connection.7 In the Name field, enter CentralRepo.

Note: Connection names cannot have spaces.8 In the Database Type list, select the appropriate database type for your

central repository.9 Complete the appropriate login information for your database type.10 Enter Central in both the User Name and Password fields.11 Activate the central repository.

Practice

Database Name User name Password

central central central

user1 user1 user1

user2 user2 user2


The list of central repository datastores now includes the newly connected central repository.

12 Exit Data Integrator.13 Start the Data Integrator Designer and log in as user2 with password

user2.14 Establish a connection between the user2 Local object library and the

central repository.15 Close Data Integrator.


Describing common tasks

When you work in a multi-user environment, you complete several tasks on a regular basis.

After completing this unit, you will be able to:• Add objects to the central repository• Add objects with filtering to the central repository• Check out objects• Undo a checkout• Check in objects• Get objects• Label objects• Compare objects• View object history• Delete objects

You can add objects from your local repository to a central repository once your connections have been set up and your central repository is activated.

You must use your local repository for the initial creation of any objects in an application. After the initial creation of an object, you can add it to the central repository. Once in the central repository, the object is subject to version control and can be shared among users.

You can only add new objects to the central repository. The central repository will not allow you to add objects that already exist in the Central object library.

You can add a single object to the central repository, or you can add an object with all of its dependents to the central repository. Dependents are objects used by another object, for example, data flows that are called from within a work flow.

Introduction

Adding objects


To add single objects or objects with dependents1 In the Designer, on the Tools menu, select Central Repositories.2 In the Central Repository Connections list, select a central repository to

make active.3 Click Activate to activate your Central object library.4 From the Designer Local object library, right-click the object and select

Add to Central Repository5 Do one of the following:

• From the drop-down list, select Object.• From the drop-down list, select Object and dependents.

6 In the Comments window, type comments regarding the object you are adding.

7 Click OK.The Output dialog box notifies you that the object has been added successfully to the Central object library.You can view objects you added to the Central object library by clicking the different object tabs.


Activity: Importing and adding objects in the central repository

ObjectivesIn this activity you will• Import objects into your local repository• Activate a connection to the central repository• Add objects to the central repository

Instructions To complete this activity successfully you use the local, central, and user1 repositories that you created in an earlier exercise for this lesson.1 Start Data Integrator Designer and log into your local repository as user1

with password user1.2 In the user1 Local object library Jobs tab, right-click in the open area,

select Repository and Import From File.3 Navigate in your Data Integrator install directory in /Tutorial Files. 4 Open the Multiusertutorial.atl file.

Note: Ignore any warnings and continue.You should see the following work flows and data flows added to your user1 Local object library.

5 Activate the connection to the central repository called CentralRepo.6 Add JOB_Employee and its dependents to the central repository.7 Add the comment: Adding JOB_Employee to the central repository.

Data Integrator adds the object and its dependents to the active central repository. The Output window also opens so that you can view the results of this task.

Now that the central repository contains all objects in the user1 local repository, developers can check these objects in and out.

Practice

Tabs Object names

Job Job_Employee

Work flow WF_EmpPos

WF_PosHireDate

Batch data flow DF_EmpDept

DF_EmpLoc

DF_PosHireDate


During the development process you can filter objects so that the information resources referenced are appropriate for the phase of the project. Filtering these objects is helpful because these objects can contain repository-specific information.

For example, datastores and database tables might refer to a database connection unique to a user or a phase of development. When multiple users work on an application, they can change repository-specific information.

Filtering allows you to customize your objects by selectively changing environment-specific information in object definitions. Using filtering you can:• Change datastore and database connection information• Change the root directory for files associated with a particular file format• Select or clear specific, dependent objects

The filtering process is available when adding, checking in, checking out, or getting the latest objects in a central repository.

To filter an object1 In the Designer, on the Tools menu, select Central Repositories.2 In the Central Repository Connections list, select a central repository to

make active.3 Click Activate to activate your Central object library. 4 Do one of the following:

• From the Local object library, right-click an object and select Add to Central Repository.

• From the drop-down list, select With filtering.The Version Control Confirmation window groups the objects by type and displays all the dependent objects, the destination status of the objects, and the planned action.

5 In the Version Control Confirmation window, expand a group and select the object you want to filter or exclude.

Adding objects with filtering to the central repository


6 In the Target Status: list, select the status of the target.You can choose to create, exclude or replace the target.

7 Click Next.The Datastore Options window appears. This window shows the dependent datastores and tables for the selected object. From here you can include, or exclude datastores and tables.Note: This window opens only if the objects that you are adding,

checking in, or checking out include a datastore.8 Click Finish.9 In the Comments window, type comments regarding the object you are

filtering.The Output dialog box notifies you that the object has been successfully added with filtering to the active central repository.

You can check out objects that you expect to change from the Central object library. You can also check out objects that you want to change in an application. Whenever you check out objects, you copy the objects from the active central repository to your local repository.

When you check out an object, you make that object unavailable to other users; other users can view the object but cannot make changes to the object. Checking out an object ensures that two users do not make conflicting changes to the object simultaneously.

Data Integrator changes the object icons in both the Local and Central object libraries to indicate that the object is checked out.

When an object is checked out, your Central object library shows you the local repository that has checked out the object. You can determine which user is working with the checked out object by looking at the repository name listed for the checked out object.

You can check out:• Single objects

For example, you can check out a work flow and change the work flow by adding a new script to the work flow. However, you cannot change the dependent objects, such as a data flow, to the work flow and retain the changes in the central repository. Changes to dependent objects will only be retained in the local repository.

• Single objects with their dependentsFor example, you can check out a work flow and make changes to the work flow or any of its dependents and retain the changes in both central and local repositories.

• Single objects or objects with dependents without replacementThe object definition from the central repository replaces existing definitions for that object in your local repository when you check out the

Checking out objects

Multi_Profile1 Multi_Profile1

Object is not checked out Object is checked out


object. You use this check out method if you want to leave intact the current object version in your local repository.For example, suppose you are working in your local repository and you make a change to an object that is not checked out. If you determine that the change improves the design or performance of your application, you will want to include that change in the central repository.Note: Use caution when checking out objects without replacing the

version in your local repository. When you do not replace the version in your local repository, you can lose changes that others have incorporated into those objects.

• Objects with filteringWhen you check out an object with filtering, the object and all its dependents are checked out. When you check out objects with filtering, you always replace local versions with the filtered objects from the central repository. Filtering allows you to:• Change datastore and database connection information.• Change the root directory for files associated with a particular file

format.• Select or clear specific dependent objects.

To check out single objects or objects with dependents1 Activate your central repository.

2 From the toolbar, click to access the Central object library, right-click an object, and select Check Out.

3 Do one of the following:• From the drop-down list, select Object.• From the drop-down list, select Object and dependents.• Data Integrator copies the most recent version of the selected object

from the central repository to your local repository, then marks the object as checked out.

The Output dialog box lets you know that you have successfully checked out the object.

To check out an object without replacement1 Activate your central repository.

2 From the toolbar, click to access the Central object library, right-click an object, and select Check Out.

3 From the drop-down list, select Object without replacement.The Output dialog box lets you know that you have successfully checked out the object without replacement.


To check out an object with dependent objects without replacement1 Activate your central repository.

2 From the toolbar, click to access the Central object library, right-click an object and select Check Out.

3 From the drop-down list box, select Object and dependents without replacement.Note: You cannot check out the datastores for an object when using this

checkout option. You must check out with filtering if you want to include the datastores in your checkout.

4 In the Warning message box, click Yes if you do not want to include the datatstores for the object you are checking out.The Output dialog box lets you know that you have successfully checked out the object with its dependents without replacement.

To check out objects with filtering1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Check Out.

3 From the drop-down list box, select With filtering.The Version Control Confirmation window groups the objects by type and displays all the dependent objects, the destination status of the objects, and the planned action.

4 In the Version Control Confirmation window, expand a group and select the object you want to filter or exclude during check out.

5 In the Target Status: list, select the status of the target.6 Click Next.

The Datastore Options window appears. This window shows the dependent datastores and tables for the selected object. From here you can include or exclude datastores and tables.Note: This window only opens if the objects that you are adding,

checking in, or checking out include a datastore.7 Click Finish.


Occasionally, you may decide that you did not need to check out an object because you made no changes. You may also decide that the changes you made to a checked out object are not useful and you prefer to leave the master copy of the object as is. In these cases, you can undo the checkout.

When you undo a checkout, you leave both the object in your local repository as well as the object in the central repository as is; no changes are made and no additional version is saved in the central repository. Only the object status changes from checked out to available.

After you undo a check out, other users can checkout and make changes to the object. You can undo a checkout on a single object or on an object with its dependents.

To undo a check out.

1 Click to access the Central object library, right-click an object, and select Undo Check Out.

2 Do one of the following:• From the drop-down list, select Object.• From the drop-down list, select Object and dependents.

The Output dialog box lets you know that you have successfully overwritten checking out the object or object with its dependents.

After you finish making changes to checked out objects, you must check them back into the central repository. Checking in objects creates a new version in the central repository and allows others to get the changes that you have made.

Checking in objects also preserves a copy of the changes for revision control purposes. You can get a particular version of a checked in object and compare it to subsequent changes or even revert to the previous version.

Check in an object when you are done making changes, when others need the object that contains your changes, or when you want to preserve a copy of the object in its present state.

You can check in:• Single objects or objects with dependents

Just as you can check out a single object or an object with all dependent objects, you can check in a single object or an object with all checked-out dependent objects (as calculated in the local repository).

• Objects with filteringJust as you can check out objects with filtering, you can check in objects with filtering. When you check in an object with filtering, the object and all its dependent objects are checked in.

Undoing check outs

Checking in objects


To check in single objects or objects with dependents

1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Check In.


4 In the Comments window, type comments regarding the object you are checking in.

5 Click OK.

To check in objects with filtering1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Check In.

3 From the drop-down list, select With filtering.4 In the Version Control Confirmation window, expand a group and select

the object you want to filter or exclude during check out.5 In the Target Status: list, select the status of the target.6 Click Next.

The Datastore Options window appears. This window shows the dependent datastores and tables for the selected object. From here you can include or exclude datastores and tables.Note: This window only opens if the objects that you are adding,

checking in, or checking out include a datastore.7 Click Finish.8 In the Comments window, type comments regarding the object you are

checking in.9 Click OK

Working in a multi-user environment requires you to make changes to objects when needed. You may also want to refresh the central object library to see if any changes have been made to other objects in the central repository while you were working with other objects.

To refresh the Central object library

• From the Central object library toolbar, click to refresh.


Activity: Checking in and out objects from the central repository

ObjectivesIn this activity you will• Check out and check in objects from the central repository• Undo checkouts from the central repository

InstructionsTo complete this activity successfully you need to logon as user1 in the central repository.1 Log into the Designer as user1 and activate the central repository.2 In the Central object library, check out the WF_EmpPos object and

dependents.3 Open the DF_EmpDept data flow.4 In the DF_EmpDept workspace window, click the Query transform name

to open the query editor.5 Change the mapping by right-clicking FName in the target schema and

selecting Cut.

6 Open the DF_EmpLoc data flow. 7 In the DF_EmpLoc workspace window, click the Query transform name

to open the query editor.8 Change the mapping by right-clicking FName in the target schema and

selecting Cut. 9 Do the same for LName in the target schema out.10 Open the Central object library and check in the DF_EmpLoc data flow.11 Add the comments: Removed FName and LName columns from

POSLOC target table.12 From the Central object library window, right-click DF_EmpLoc and

select Show History. Notice a version log of this object has been saved.

Practice


13 Close the History window.14 In the Central object library, right-click the DF_PosHireDate data flow

and check out the data flow object.15 Open the DF_PosHireDate data flow. 16 In the DF_PosHireDate workspace window, click the Query transform

name to open the query editor.17 Change the mapping. Right-click LName in the target schema, and select

Cut.18 Return to the data flow and save your work.19 Now you realize you modified the wrong data flow. In the Central object

library, undo checkout for the DF_PosHireDate object. In the Central object library, open DF_PostHireDate and you should see that no changes have been made to it in the central repository.

20 Log out of the Designer.


Activity: Checking out objects with filtering

ObjectivesIn this activity you will check out an object with filtering.

InstructionsTo complete this activity successfully you use logon as user2 and in the central repository.1 Log into the Designer as user2 and activate the central repository.2 Open the Central object library.3 Check out the WF_EmpPos object with filtering.

The Version Control Confirmation window opens and displays all dependent objects, destination status of the objects, and planned action. Objects are grouped by object type.

4 Expand the list for each object type and view particular objects.5 Exclude the NamePos_Format object from theTarget Status field so that

this object is not checked out.6 Exclude the POSLOC table from being checked out.7 Finish checking out WF_EMPos.8 Verify the filter changes by opening the DF_EmpDept data flow from the

user2 Local object library.9 Notice that the NamePos_Format file format is not checked out.

Tip: Checked out objects have a red check mark.

Practice


Getting an object copies the latest version of that object in the Central object library and copies it into your local repository. This replaces the version of the existing object in your local repository.

When you get an object, you do not check out the object. The object remains free for others to check out and change. You can get an object with or without dependent objects and filtering. For information about getting earlier versions of objects or objects with particular labels, see the “View object history” section in this lesson.

To get a single object or an object with its dependents1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Get Latest Version.


To get an object and its dependent objects with filtering1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Get Latest Version.

3 From the drop-down list, select With filtering.4 In the Version Control Confirmation window, expand a group and select

the object you want to get.5 In the Target Status: list, select the status of the target.6 Click Next.7 The Datastore Options window appears. This window shows the

dependent datastores and tables for the selected object. From here you can include or exclude datastores and tables.

8 This window only opens if the objects that you are adding, checking in, or checking out include a datastore.

9 Click Finish.10 In the Comments window, type comments regarding the object you are

getting.11 Click OK.

Getting objects


Labelling objects helps you organize and track the status of objects in your application. When you label an object, the object and all its dependent objects are labeled. A label not only describes an object, but also allows you to maintain relationships between various versions of objects.

For example, a job is added to the central repository where user1 works on a work flow in the job, and user2 works on a data flow in the same job. At the end of the week the job is labeled End of week one status.

End of week one status includes these tasks:• User1 checks in two versions of the work flow• User1 checks in four versions of the data flow into the central repository

The End of week one status label contains version one of the job, version two of the work flow, and version four of the data flow.

In the following weeks, user1 and user2 continue to change their respective work flow and data flow as shown below:

After a few changes you want to get a job with the version of a data flow with the End of Week one status label. You can accomplish this by getting the job by its label.

The label End of week one status serves the purpose of collecting the versions of the work flow and data flow that were checked in at the end of the week. Without this label, you would have to get a particular version of each object to reassemble the collection of objects labeled End of week 1 status.

Labeling objects

Work Flow Version 1

Data Flow Version 1

Data Flow Version 2

Data Flow Version 3

Work Flow Version 3

Data Flow Version 5

Objects added on Monday

Versions checked in to the Central Repository later

JobVersion 1

Work Flow Version 2

Data Flow Version 4

Label: “End of the week 1 status”

Get job and dependent objects

Work Flow Version 2

Data Flow Version 4

JobVersion 1

Work Flow Version 2

Data Flow Version 4

JobVersion 1


To label a single object or an object with its dependents1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Label Latest Version.


4 In the Label Latest Version window, type the text that describes the current status of the object in the Label: box, and click OK.The Output dialog box lets you know that you have successfully created a label for the object or object with its dependents.

To get an object by its label1 Activate your central repository.

2 Click to access the Central object library, right-click an object, and select Show History.

3 In the History window, click the version of the object with the label you want.

4 Click Get Obj By Label.The Output box lets you know that you have been successful in getting the object by label.


You can compare any two objects and their properties in both the Local and Central object libraries by using the new Difference Viewer utility.

The Difference Viewer utility allows you to compare:• Two different objects.• Different versions of the same object.• An object in the Local object library with its counterpart in the Central

object library.

You can compare top-level objects, or you can include the object’s dependents in the comparison. However, the objects must be of the same type. For example, you can compare a job to another job or a custom function to another custom function. However, you cannot compare a job to a data flow.

For a multi-user environment you can access the Difference Viewer from the Central object library by using the Show History option. The Show History option is discussed in more detail in the next section “View object history” of this lesson.

To display the Differences Viewer• In the Designer View menu, point to Toolbars and select Difference

Viewer.

The Difference Viewer displays the objects and dependents, a toolbar, navigation bar, and status bar as shown below:

Comparing objects


ToolbarUsing the toolbar, you can navigate, filter, and show levels between objects being compared.

Each option is represented by the following buttons:

Buttons Description

The differences found in the objects compared are high-lighted in the Difference Viewer.

Use these buttons to navigate between the First, Previous, Current, Next, and Last differences.

Use these buttons to enable and disable filtering applied to the objects being compared.

When enabling filtering you can also select to:

Hide nonexecutable elements to remove from view those ele-ments that do not affect job execution.

Hide identical elements to remove from view those elements that do not have differences.

Use this option to show the levels of the objects selected for the comparison.

Show Level 1 shows only the objects you selected for com-parison, Show Level 2 expands to the next level, and so on.

Show All Levels expands all levels of both trees.

This option opens a text search window.

This option opens the currently active Difference Viewer in a separate window.

You must close this window before continuing in Data Inte-grator.


Navigation barThe vertical navigation bar contains colored bars that represent each of the differences throughout the comparison.

The colors correspond to those in the status bar for each difference. An arrow in the navigation bar indicates the difference that is currently highlighted in the panes.

You can click the navigation bar to select a difference; the cursor point will have a star on it.

The purple brackets in the bar indicate the portion of the comparison that is currently in view in the panes.

Status barThe status bar includes a key that illustrates the color scheme and icons that identify the different states you can find within each result displayed for the objects compared in the Difference Viewer.

A description for these states is provided below:• Deleted: the item does not appear in the object in the right pane.• Changed: the differences between the items are highlighted in blue (the

default) text.• Inserted: the item has been added to the object in the right pane.• Consolidated: this icon appears next to an item if items within it have

differences. You can expand the item by clicking its plus sign to view the differences.

The status bar also displays the difference that is currently selected in the comparison as shown below:


To compare two different objects in the Difference Viewer1 In the Local object library, right-click an object, and point to Compare.2 Do one of the following:

• Click Objects to Central to compare the selected object to its counterpart in the Central object library.

• Click Object with dependents to Central to compare the selected object and its dependent objects to its counterpart in the Central object library.

• Click Object to... to compare the selected object to another similar type of object.

Click Object with dependents to... to compare the selected object, and its dependents to another similar type of object.The cursor changes to a target icon.

3 Move the cursor over the comparison object.The target cursor changes color when it moves over an object that is eligible for comparison.

4 Click the object you want to compare.The Difference Viewer displays in the workspace. Changed items are identified with a combination of icons, color, and background shading.Depending on the object type, the panes displays items such as the object’s properties and the properties of and connections (links) between its child objects.Tip: You can have multiple Difference Viewer windows open at a time in

the workspace. To refresh a Difference Viewer window, press F5.


The central repository retains a history of all changes made to objects in the central repository. You can use this history to help manage and control development in your application.

You want to use this option to examine the history of an object, to get a previous version of an object, and to get an object with a particular label.

If you are working in a multi-user environment and using a Central object library, you can compare two objects that have different versions or labels.

To view object history1 Activate your central repository.

2 Click to access the Central object library, right-click an object and select Show History.

The History window displays version, label, repository, date, and action (the type of change a user made to the object) and comment information about each revision of the object.

3 Do one of the following:• Select an object and click Get Obj By Version to get a previous

version of an object.• Select an object and click Get Obj By Label to get an object by its

label.• Ctrl-click to select two objects and click Show Differences or Show

Differences with Dependents to compare two objects that have different versions or labels.The Difference Viewer window displays both objects and objects and dependents (if selected) for viewing.

Viewing object history


You can delete objects from either the central repository or a local repository.

Deleting objects from your local repository does not simultaneously delete it from the central repository. Similarly, when you delete an object from a central repository, you do not delete the object from the connected local repositories.

Deleting objects from the central repository does not delete any dependent objects. When you delete objects from the central repository, you only delete the selected object and all versions of the selected object.

To delete objects 1 Do one of the following:

• In your local repository, from the Local object library, select the object you want to delete.

• In the active central repository, from Central object library, select the object you want to delete.

2 Right-click the object and select Delete.

Deleting objects


Activity: Comparing objects using the Difference Viewer

ObjectivesIn this activity you will• Modify New_Functions_Job• Compare the modified version with the original version in the central

repository

Instructions1 Activate the central repository from the Designer Tool menu.2 Right-click New_Functions_Jobs and add the job with filtering to the

central repository. 3 Click Next to accept the default options for the Version Control number

window.4 Click Finish to accept the default options for Datastore options windows.5 Type My New Functions Job in the Add - With filtering comments box

and, click Apply to All.The Output dialog box should indicate that all your objects have been added.

6 Open the Central object library from the Designer Tool bar.You should see the New_Functions_Job under the Jobs tab of the Central object library:

Practice


7 Keeping the Central object library open, go back to your Local object library.

Note: The Central object library snaps into place in your Designer when you go back to your Local object library:

8 In your Local object library, modify New_Functions_Job by adding a new CITY column to the Query Schema Out in the Query Editor.

9 From the Designer View menu, click Toolbars to launch the Differences Viewer.The Differences Viewer Tool Palette displays vertically on the far, right-hand side of the Designer window.

10 From the Central object library, compare the New_Functions_Job with dependents to the Local object repository.The Differences Viewer appears in the Designer workspace. The first object you selected appears in the left pane of the window, and the second object appears on the right. Following each object name is its location.Tip: You can have multiple Difference Viewer windows open at a time in

the workspace. To refresh a Difference Viewer window, press F5.11 Hide nonexecutable elements and identical elements by selecting the

Diff filters icon from the Difference Viewer tool palette on the right-side of the Designer window.


The identical elements should now be hidden:

12 Browse to the Table label to see the change we made in the Table object when we inserted the CITY column.The object city should be highlighted in green:

Tip: If you know the hierarchy level of the object you modified, you can use the different Show Level options from the Differences Viewer tool palette on the right.


Lesson Summary

Quiz: Supporting a Multi-user Environment1 True or False. Data Integrator provides feedback through trace, error,

and statistics logs during both parts of the design phase.

2 What are dependent objects?

3 True or False. You will run across these terms when working in a Data Integrator multi-user team environment: • Highest level object• Object dependents• Object version

4 Which repository do you use to create and change objects for an application?

5 What must you do if you want to make changes to an object that resides in the Central object library?

6 True or False. Logging in a Central Repository is not recommended.

Review


After completing this lesson, you are now able to:• Explain the stages of Data Integrator’s development process in a multi-

user environment• Describe terminology used in a multi-user environment• Explain repository types in a multi-user environment• List repository maintenance guidelines• Create a central repository• Define a connection to the central repository• Activate the central repository• Add objects to the central repository• Add objects with filtering to the central repository• Check out objects• Undo a checkout• Check in objects• Get objects• Label objects• Compare objects• View object history• Delete objects

Summary


Lesson 12Migrating Projects

After you create your projects you can move them from a development and testing environment to a production environment.

In this lesson you will learn about:• Understanding migration tools• Using datastore configurations and migration• Migrating a single-user environment without a central repository• Migrating a multi-user environment

Duration: 1 hour


Understanding migration tools

Data Integrator offers different tools that you can use to migrate your projects from the design, testing, and production phases. The migration tool you choose depends on the scope of your project.

After completing this unit, you will be able to:• Prepare for migration• Describe migration mechanisms and tools• Choose a migration mechanism

You can help facilitate the migration process between development phases by implementing standardized naming conventions for external data sources, directory locations, schema structures, and owners.

External data sourcesWhile the actual data you are extracting, transforming, and loading usually differs by database, the essential structure of the data should be the same for every database on which you want the same applications to work.

Using generic naming for similar external datastore connections reduces the time to reconfigure the connections to the same database type between project phases.

For example, naming each connection according to each phase (Test_DW, Dev_DW and Prod_DW) requires you to reconfigure each datastore connection including user name, password, and host string names against the Test, Dev, or Prod instance.

Instead, you could call the connection string DW and then point it to the different databases in the development, test, or production phases. Therefore, the datastore connection, regardless of which phase you run the job against, runs without users having to edit the datastore properties.Note: When you use this generic, cross-phase naming method, you cannot

access both the development and test connections from the same computer because the connection string maps only to one instance.

Directory locationsIt is recommended you use logical directory names or point to common local drives to standardize directory location. For example, use X:\ as a logical directory or since every computer has a C: drive, pointing to C:\Temp is a safe reproducible standard.

Introduction

Preparing for migration

Migrating Projects—Learner’s Guide 12-3

Schema structures and ownersTo further facilitate a seamless structure between development phases, give all your database instances the same owner name for the same schema structures from which you are reading and to which you are loading.

Regardless of name, the owner of each schema structure can vary and Data Integrator will reconcile them.

In addition to using standard naming conventions, you can also use datastore configurations within your datastore connections.

Datastore configurations allow you to create multiple connections within each datastore so you can easily move jobs from development to test and production servers. Datastore configurations also support aliases and owner names for database objects. This feature is covered in more detail in the next section of this lesson.


The migration mechanism and tools you use will depend on the needs of your development environment. Data Integrator provides two ways to move jobs and dependent objects between repositories:• Multi-user migration via central repositories

This method provides source control for multi-user development and saves old versions of jobs and dependent objects. Multi-user migration works best in larger projects where two or more developers or multiple teams are working on interdependent parts of Data Integrator applications throughout all phases of development.

• Single-user export and import migrationYou use a wizard to help you map objects between source and target repositories. Using export and import you can point to information resources to be remapped between repository environments. There is no central repository used with this method. This method works best with small to medium-sized projects where a single or a small number of developers work on somewhat independent applications through all phases of development.

Although Data Integrator supports a multi-user environment, you may not need to implement this architecture on all projects.

If your project is small to medium in size and only consists of one or two developers, then a central repository may not be a necessary solution to integrating the work of those developers.

For example, only two Data Integration Consultants work on ABC project.

During the development phase of this project, consultant A manages the master repository, and consultant B works on a new section within a complete copy of the master repository from their local repository.

Consultant B then exports this new section back into the master repository using the export method. Consultant A upgrades the master repository and takes a new complete copy of the master repository overwriting the previous copy.

You can use the following matrix to help you determine which mechanism and tools would work best in your environment.

Describing migration mechanisms and tools

Choosing a migration mechanism

Naming conventions

Tools

Need a “fast and easy”migration solution

Optimal solution: Compatible solution:

Migration Mechanisms

Multi-user Configurations

Source data from multiple homogeneous systems

Multiple-team project

Small to medium-sized project

Situation / Requirements Export / Import

Migration Guideline Matrix


Using datastore configurations and migration

Data Integrator XI introduces datastore configurations to improve on the datastore profiles feature in previous versions.

The datastore configuration feature allows you to consolidate separate datastore connections for similar sources or targets in a job into one source or target datastore with multiple configurations. You can then select the set of configurations that includes the sources and targets you want by selecting a system configuration when you execute or schedule the job.

After completing this unit you will be able to:• Create multiple configurations in a datastore• Use the Rename Owner tool to rename database objects • Use the Rename Owner tool in a multi-user environment• Create a system configuration

Multiple datastore configurations enable you to create multiple connections within each datastore so you can easily move jobs from development to test and production servers.

You can use multiple configurations to minimize your effort in migrating existing jobs from one database type and version to another. When you add a new configuration, Data Integrator modifies the language of data flows that contain table targets and SQL transforms in the datastore. It adds the target options and SQL transform text to the new database type or version based on the target options and SQL transform text that you defined for an existing database type or version.

This functionality performs in the following ways:• If the new configuration has the same database type and version as the

old configuration, nothing happens. This is because the targets set in the Target Table Editor and SQL transforms already have the properties set for the correct database type and version.

• If the new configuration has the same database type but a higher database version than the old configuration, Data Integrator copies all the properties of the targets and SQL transforms from the old configuration to the new configuration. For example, if a target T1 has the auto-correct option set to TRUE for the old configuration's database type or version, the same target T1 will have the auto-correct option set to TRUE for the new configuration's database type or version.

• If the new configuration has the same database type but a lower database version than the old configuration, Data Integrator copies all the properties except the bulk loader properties from the old configuration to the new configuration. The same situation applies if the new configuration has a different database type from the old configuration.

Introduction

Creating multiple configurations in a a datastore


To create multiple datastore configurations in an existing datastore1 In the Datastore tab of the Local object library, right-click a datastore, and

select Edit.2 In the Edit Datastore window, click Advanced.

A grid of additional datastore properties and the multiple configuration controls displays.

3 Click Edit.The functionality of the Configurations window is an extension of functionality within the Datastore Editor. When you open this dialog, you will always see at least one configuration that reflects the values from the Datastore Editor. This configuration is the default configuration.The model shows the datastore configurations and properties in a grid. Each column represents a database configuration and each row represents a datastore property.


If a property does not apply to a configuration, the cell displays N/A in gray and does not accept input. Cells that correspond to a group header also do not accept input, and are marked with hashed gray lines.

4 From the toolbar, click to create a new configuration and type the name for your new configuration. Note: Do not include spaces when assigning names for your datastore

configurations5 .In the Create New Configuration window, complete the connection,

general and local information. Note: Different properties appear depending on the selected datastore

type and (if applicable) database type and version.6 Click Aliases (Click here to create) to create a new alias.

Note: When you create a datastore configuration, you first create an alias for the datastore and then define the owner name that the alias name maps to. You can create any number of aliases for a datastore and then map datastore configurations to them. When you import metadata for objects using this configuration, the datastore configuration’s alias is automatically substituted for the real owner names.After you create an alias, such as ALIAS1, you can then navigate to each configuration and define the real owner name that the alias maps to. Also note that the owner names are not labeled in the configuration window.

When you delete an alias name, the operation applies to the datastore (all configurations). Data Integrator removes the selected row.


7 Click OK.8 A second configuration is added to the Configurations window.

9 Click Apply to save your new configuration.

Configurations toolbarUsing the Configurations window toolbar icon options you can also perform these actions:

Icon Description

Create a new configuration.

Select an existing configuration, click the icon, and dupli-cate the configuration.

Select an existing configuration, click the icon, and rename the configuration.

Select an existing configuration, click the icon, and delete the configuration.

Sort configurations in an ascending order.

Sort configurations in a descending order.

Move the default configuration to the first column in the list

Create a new alias for the datastore. To map individual con-figurations to an alias, enter the real owner name of the con-figuration in the grid.


After you save your new configuration, Data Integrator copies the existing SQL transform and target table editor values, and displays a report of the modified objects in the Output window:

This report can be used as a guide to manually change the properties of targets and SQL transforms, as needed. This report contains information on:• Names of the data flows where the language was modified• Objects in the data flows that were affected• Types of the objects affected (table target or SQL transform)• Usage of the objects (source or target)• Objects that have a bulk loader when bulk loading options are used• Confirmation on whether the bulk loader option was copied when bulk

loading options are used

Select an existing alias and delete it.

Expand all categories.

Collapse all categories.

Show/Hide details.

Icon Description


Data Integrator allows you to rename the owner of tables, template tables, or functions using the owner renaming feature.

The owner renaming feature enables you to easily replace real owner names with a datastore configuration alias category for existing table and function metadata. Consolidating metadata under a single datastore configuration alias name allows you to access accurate and consistent dependency information at any time; while also allowing you to switch between configurations when you move jobs to different environments.

When you use objects stored in a central repository, a shared alias makes it easy to track objects checked in by multiple users. If all the users of the multiple local repositories in your team use the same alias, Data Integrator can track dependencies for objects that your team checks in and out of the central repository.

When you create a datastore configuration, you first create an alias for the datastore and then define the owner name that the alias name maps to. You can create any number of aliases for a datastore and then map datastore configurations to them. When you import metadata for objects using this configuration, the datastore configuration’s alias is automatically substituted for the real owner names.Note: Owner renaming affects the instances of a table or function in a data

flow, not the datastore from which they were imported.

After you change the owner name of an existing datastore table or function, the repository does not store the original owner name. You can then map the new owner name to the original owner name so that Data Integrator will process the correct object at runtime.

Data Integrator supports both case-sensitive and non-case-sensitive owner renaming.• If the objects in your default datastore configuration are from a case-

sensitive database, the owner renaming tool preserves case sensitivity.• If the objects within the same datastore with multiple configurations

contain both case-sensitive and non-case-sensitive databases, Data Integrator takes the case-sensitivity used by the default datastore configuration into account when renaming the new owner names.Tip: To ensure that all objects are portable across all configurations in

this scenario, use all uppercase names for all owner names and object names.

Using the Rename Owner tool to rename database objects


To rename the owner of an object1 In the Datastore tab of the Local object library, expand one of your

datastores.2 Expand and right-click a table, template table, or function category.3 Select Rename Owner.4 In the New Owner Name: field, type in the new owner name.5 Click Rename.

Note: If the object with the new owner name already exists in the datastore, the Designer will determine if the object has the same schema as the object to be renamed. If they are the same, then the Designer allows the rename to proceed. If they are different, then the Designer displays a message to that effect.

During the owner renaming process Data Integrator updates: • The dependent objects (jobs, work flows, and data flows that use the

renamed object) to use the new owner name.• The Central object library to display the object with the new owner name. • All the dependent objects and deletes the metadata for the object with the

original owner name from the object library and the repository.

There are several possible behaviors that you should be aware of when you use the rename tool and are checking objects in and out of a central repository.

The possible scenarios depend on the check-out state of the renamed object and whether there are any dependent objects that refer to the renamed object.

These scenarios include:• The object is not checked out, and the object has no dependent objects

in the local or central repository.In this scenario, the Designer renames the object owner when you use the rename tool.

• The object is checked out, and the object has no dependent objects in the local or central repository. In this scenario, the Designer renames the object owner when you use the rename tool.

• The object is not checked out and object has one or more dependent objects in the local repository.

Using the Rename Owner tool in a multi-user environment


In this scenario, the Designer displays a second dialog that displays the dependent objects that use or refer to the renamed object, as shown below:

If you continue, the Designer renames the objects and modifies the dependent objects to refer to the renamed object using the new owner name. If you cancel, no changes are made.Note: An object still might have one or more dependent objects in the

central repository. However, if the object to be renamed is not checked out, by design, the Rename Owner feature does not affect those dependent objects in the central repository.

• The object is checked out and has one or more dependent objects. This is the most complicated scenario.When you use the rename tool in this scenario, a second dialog pops up to display the dependent objects and a status indicating their check out state and location.

• If the dependent object is in only the local repository, the status displayed is Used only in local repository- No check out necessary.

• If the dependent object is in the central repository — regardless of whether the parent object also exists in the local repository— and is not checked out by anybody, the status displayed is Not checked out.

• If the dependent object is already checked out by you, or by another user, the status displays the name of the checked out repository; for example, Microsoft SQL Server.production.user1.

For any of the situations encountered when you use the rename tool to check out an object that has one or more dependents, use the Refresh List option in the Rename owner window to update the check out status in the list. For example, refreshing the check out status list is useful when a dependent object in the central repository has been identified, but another user has it checked out. When that user checks in the dependent object, refreshing the list shows that the dependent object is no longer checked out.


Tip: In order to use the Rename Owner feature to its best advantage, check out the dependent objects from the central repository. This helps to avoid having dependent objects referring to objects with nonexistent owner names.

• When you check out all dependent objects and use the rename tool, the Designer renames the owner of the selected object and modifies all dependent objects to refer to the new owner name. Although it may look like the original object was modified to have a new owner name, in reality the Designer has not modified the original object. Instead the Designer has created a new object identical to the original, but uses the new owner name. The original object with the old owner name still exists. The Designer then performs an undo checkout on the original object. It then becomes your responsibility to check in the renamed object.


.

Activity: Creating multiple datastore configurations

ObjectiveIn this activity you will:• Create two datastore configurations: one connecting to the Test source

demo database, one connecting to the Production source database• Use the multiple configurations to change the datastore connection to

point to production database

Instructions1 In the Datastores tab of the Designer Local object library, right-click New. 2 In the Create New Datastore window, create a new datastore named

Demo_ODS_DS that points to the Demo_DS database.Note: Database server name should be localhost. If different, check the

machine name in MS SQL Enterprise Manager. Use DEMO_ODSu for the user name. No password is required.

3 Click Advanced, and then click Edit.4 Double-click the Configuration1 heading, and change Configuration 1 to

TEST.

5 From the Configurations for Datastore toolbar, click to create a new datastore configuration.

Practice


6 In the Create New Configuration window, in the Name field, type PRODUCTION.

7 Leave the remaining default values, and click OK.The TEST and PRODUCTION datastore configurations should be displayed side by side:

8 In the configuration settings for the PRODUCTION configuration, modify the:• Server name to be your computer name, if different.

You can check your server/machine name in MS SQL Enterprise Manager.

• Database name to be ODS_DS. Notice that your TEST configuration points to DEMO_ODS, but you want your PRODUCTION configuration to point to ODS_DS. This setup simulates a Test to Production connection environment.

• User name to be ODSuser with no passwordThis is the user name that was used during the ODS database setup as indicated in the Facilities setup instructions for this course.

Tip: You can go between the TEST and PRODUCTION configurations by double-clicking each configuration header, or by selecting each configuration by selecting it from the drop-down menu in the Go to: area in the toolbar.

9 Click OK to save and close the configuration changes.10 Click OK to save and close the datastore changes.

To switch from TEST and PRODUCTION datastore configurations• Right-click DEMO_ODS_DS, and select Edit.• Click Advanced, and then click Edit again.


• Under the PRODUCTION configuration settings, click the Default configuration option, and from the drop-down list, select Yes.

Note: Notice that the default configuration value for TEST automatically changes to No.

11 Click OK twice to exit the Configurations for Datastore and Edit Datastore windows.To switch back to connect to the TEST data source, change the default configuration value under TEST to Yes.


System configurations define a set of datastore configurations that you want to use together when running a job. In many organizations, a job designer defines the required datastore and system configurations, and a system administrator determines which system configuration to use when scheduling or starting a job in the Data Integrator Administrator. You learn about the Administrator in Lesson 13.

When designing jobs, determine and create datastore configurations and system configurations depending on your business environment and rules. Create datastore configurations for the datastores in your repository before you create the system configurations for them.

Data Integrator maintains system configurations separately. You cannot check in or check out system configurations. However, you can export system configurations to a separate flat file which you can later import. By maintaining system configurations in a separate file, you avoid modifying your datastore each time you import or export a job, or each time you check in and check out the datastore.Note: You cannot define a system configuration if your repository does not

contain at least one datastore with multiple configurations.

To create a system configuration1 In the Designer, click Tools and select System Configurations.

The System Configuration editor displays columns for each datastore that has multiple configurations:

2 Under the Configuration name column, in the text box, type the configuration name.Tip: Use the SC_ prefix in the system configuration name so that you can

easily identify this file as a system configuration, particularly when exporting.

3 For each available datastore column, click the text box, and from the drop-down menu select the appropriate datastore configuration that you want to use when you run a job using this system configuration.

4 Click OK.

Creating a system configuration


Migrating a multi-user environment

Using this method you can provide source control and save old versions of jobs and dependent objects.

After completing this unit, you will be able to:• Distinguish between a phased and a versioned model for multi-user

migration• Add a project to a central repository• Get the latest version of a project• Update the project• Copy contents between central repositories

The multi-user migration method works best in larger projects where two or more developers or multiple teams are working on interdependent parts of Data Integrator applications throughout all phases of development.

Typically, applications go through different phases during the development to production phase. For example, an application might go through three phases:• Developers creating an application• Testers validating the application• Administrators running the application

A single, central repository can support your application through these phases. You can use projects and job labeling to maintain application components independently for each phase. In addition, filtering of database owners and file locations allows you to configure the application to run in each local environment.

In some situations, you may require more than one central repository for application phase management. If you choose to support multiple central repositories you can use a single local repository as a staging location for the transition.

Data Integrator offers project support for migration projects between local and central repositories. Jobs consisting of objects and dependents are grouped into projects in the Central object library of the central repository for source control.

During the migration process you can:• Add projects to the central repository• Get the latest version of a project from the central repository• Update a project in the central repository• Copy contents between central repositories

Introduction

Distinguishing between a phased and a versioned migration


Depending on the size of your project you can choose between a phased or a versioned model.

Phased models are good for large, multi-user development environments, where multiple developers work on interdependent parts of a project.

The phased model uses more than one central repository during different development phases. For example, you can use one central repository for development/test purposes and one central repository for test/production purposes.

Note: Before you run the job in the production environment, label the job with dependencies and the current date. This way, you can always revert back to the last version of the job in the production environment by getting the label by date. Make sure you also keep track of the labels manually because Data Integrator does not offer a report or log option for object labels.


A Versioned model is better for smaller projects where developers work on interdependent or independent parts of a project.

In a versioned model, you use only one central repository for testing purposes and then move the central repository into the production environment. This means that keeping track on all changes made to jobs by different developers within a single repository, even if you are using labels, can become complicated in a large environment.


You can add any project and its dependent objects to the central repository. Only new objects are added because previously existing dependent objects are not overwritten in the central repository.

To add a project to the central repository1 From the Data Integrator Designer toolbar, click Tools and select Central

Repositories to activate your central repository.2 In the Designer Local object library, under the Project tab, right-click the

project, and select Add to Central Repository.3 Do one of the following:

• From the drop-down list, select Object and dependents.Note: when you select Object and dependents you can apply comments

to one or all of the objects that you are adding to the central repository.

• From the drop-down list, select With filtering.

Getting the latest version of a project from the central repository copies the project definition and its associated jobs to your local repository.

For example, in your local repository you create a project Proj_Sales to group all the jobs that are used to extract data from a sales system. You also add the following jobs to Proj_Sales:• Job_Sales_Monthly that you will use for monthly extractions• Job_Sales_Weekly that you will use for weekly extractions

You then add Proj_Sales with all its objects and dependents to the central repository so that others can access your project.

Your fellow co-developer checks out Job_Sales Monthly from the central repository while you check out Job_Sales Weekly. Both of you modify the jobs and check the jobs back into the central repository. To keep track of the changes made to Proj_Sales, you can get the latest version of the project to see the changes you and other co-developers have made.

To get the latest version of a project1 Activate your central repository.2 In the Central object library, under the Project tab, right-click the project,

and select Get Latest Version.3 From the drop-down list, select Object and dependents.

Adding a project to a central repository

Getting the latest version of a project


You can update the project definition of a project in the Central object library to include any new jobs that you have added to the project in your local repository. Updating the project definition in the central repository adds new jobs but does not overwrite existing jobs in the central repository.

In the previous example, you created a Proj_Sales with jobs to track monthly and weekly sales. You now want to add a job to track daily sales to the project. You add Job_Sales_Daily to Proj_Sales from your local repository.

At the same time, a couple of your fellow co-developers have checked out Job_Sales_Monthly and Job_Sales_Weekly, modified them and checked them back into the central repository.

You do not want to overwrite the monthly and weekly jobs that your fellow co-developers have modified in the central repository with the versions that you have in your local repository. With the update option, you can update Proj_Sales in the central repository to include the new job from your local repository without overwriting the other jobs in the central repository.

To update a project• In the Designer Local object library, under the Project tab, right-click the

project, and select Update.

You cannot directly copy the contents of one central repository to another central repository. Instead you must use your local repository as an intermediate repository.

To copy the contents of one central repository to another central repository1 In your local repository, activate the central repository from which you

want to copy the contents.2 In your local repository, get the latest version of the contents from the

central repository you just activated.3 In your local repository, activate the central repository to which you want

to copy the contents.4 Add the content from your local repository to the central repository.

Sometimes you may need to recopy the contents of one central repository into another. For example, you will need to recopy contents if a part of a job in its testing phase is reassigned to a redesign phase.

Updating a project

Copying content from central repositories


To recopy contents from one central repository to another1 In your local repository, check out specific objects without replacement

from the second central repository.2 In your local repository, get the latest version of the objects from the first

central repository.3 In your local repository, check in the updated objects to the second

central repository.


Migrating a single-user environment

This basic migration method allows you to point to information resources to be remapped between repository environments without using a central repository.

After completing this unit, you will be able to:• Import and export objects to a repository• Export objects to a file• Export a repository to a file

The export and import is a basic migration method that works best with small to medium-sized projects where a single or a small number of developers work on somewhat independent applications through all phases of development.

In a single-user environment there is no need for a central repository. In this environment you can export and import jobs and dependent objects directly from one local repository to another. You can also export all metadata attributes associated with selected objects.

When you export a job from a development repository to a production repository, you can change the properties of objects being exported to match your production environment.

First, you export jobs from the local repository to either a file or a database, then you can import them into another local repository. For example, when moving from a design repository to a test repository, you export from your design repository and import the file or database to your test repository.

You can export back and forth between the design and test repository to test and rectify application errors.

Introduction

Importing and exporting objects to a repository


During your initial export you export jobs with all dependent objects and change properties of datastores and file formats to point to the correct information resources, if necessary.

In subsequent exports you export jobs with all dependent objects and exclude datastores and unchanged file formats to avoid overwriting those already defined in the destination repository.Note: When you export objects to a flat file that is located in another

repository, you must be able to connect to and have write permission for that repository, and your repository versions must match.

You can export objects from the current repository to another repository provided that both repositories are running the same Data Integrator product version.

The export process allows you to change environment-specific information defined in datastores and file formats to match the new environment.

For example, you can change datastore configurations (application and database locations and login information) to reflect production sources and targets.

When you export an object, the datastores, file formats, and custom functions included in the object definition are also automatically exported. Note: Use caution when exporting objects directly to another repository as

this may overwrite the objects in the other repository.

The Export editor displays the objects, datastores, formats and user functions available for export.

You can refine the object that you are exporting by selecting the options available for each object in the Export editor. These options may vary depending on the object type.


All the available options are listed below:

Exclude The Exclude option removes only the selected object from the list of objects to be exported. The object remains in the list, but its exclusion is indi-cated by a red x on the object icon.

All occurrences of the object are excluded.

Include The Include option adds an excluded object to the export plan. The red X on the icon disappears. All occurrences of the object are included.

When you export, the included objects are copied to the destination.

Exclude tree The Exclude tree option removes the selected object and all objects called by this object from the export. The objects remain in the list, but their exclusion is indicated by a red x on the icons. The selected object and any objects it calls are excluded.

When you export the list, the excluded objects are not copied to the destination.

Include tree The Include tree adds the selected excluded object and the objects it calls to the export list. The red x on the selected object and dependents disappears.

When you export the list, the included objects are copied to the destination.

Exclude environmen-tal information

The Exclude environmental information removes all connections (datastores and formats) and their dependent content (tables, files, functions) from the objects in the Export editor.Using this option you can export jobs without con-nections so that you avoid connection errors.

It is recommended that you configure datastores and formats for the new environment separately.

When you export, excluded objects are not copied to the destination.

Include environmental information

The Include environmental information adds all con-nections (datastores and formats) and their depen-dent content (tables, files, functions) to the objects you want to export.

Clear All The Clear All option removes all objects from all sections of the editor.


To export objects to a repository1 In the Designer Local object library, right-click the object and select

Export.The Export editor appears on the right-side pane of the Designer.Note: The Export editor displays the objects that you have exported in

the respective areas based on object type —datastores, formats or user functions.

2 Right-click each object, and select Export.3 In the Export Destination window, add the destination database

connection information.4 Click Next.5 In the Export Confirmation box, click Next to export all objects.

Note: Use the Export confirmation box to select specific objects you want to export by clicking the object itself. The Destination status column shows the status of each object in the target database and the proposed action. If the object that you are exporting already exists in the target repository, you can select to replace it or exclude it from your export. If the object that you are exporting does not exist in the target repository, you can select to create the new object in the target repository.

6 In the Datastore Export Options window, click Next.7 If applicable, in the File Format Mapping window, select a file and change

the destination root path.Note: You can change the destination root path for any file formats to

match the new destination.8 Click Finish.

Data Integrator copies objects in the Export editor to the target destination. The Output dialog box shows the number of objects exported.

Delete Removes the selected object and objects it calls from the Export editor. Only the selected occur-rence is deleted; if any of the effected objects appear in another place in the export plan, the objects are still exported.

This option is available only at the top level. You cannot delete other objects; you can only exclude them.

Export Starts the export process.


To change datastore connections when exporting objects to a repository1 In the Export editor, right-click an object and select Export.2 In the Export Destination window, add the destination database

connection information.3 Click Next.4 In the Datastore Export Options window, select a datastore and click

Advanced.5 In the Export Datastore window, click Advanced.

The datastore Configurations window displays:

6 Click Edit to expand the Configurations for datastore window.7 In the Configurations for datastore window, under the Go to drop-down

menu, select the configuration you want this datastore to export with.8 Click OK.9 In the Export Datastore window, click OK.10 In the Datastore Export Options window, click Next.11 If applicable, in the File Format Mapping window, select a file and change

the destination root path.Note: You can change the destination root path for any file formats to

match the new destination.12 Click Finish.

Data Integrator copies objects in the Export editor to the target destination. The Output dialog box shows the number of objects exported.


You can also export objects to a file such as an ATL (Acta Transformation Language) file. Exporting objects to a file allows you to send this file to someone via email. However, when exporting to a file, you are not able to change environment-specific information using this option.

Objects in a repository are exported in the Data Integrator’s scripting language format (.atl), while whole repositories can be exported in either the .atl or .xml format. Tip: Using the .xml file format might make repository content easier for you to

read. It also allows you to export Data Integrator to other products.

To export objects to a file1 In the Designer Local object library, right-click the object and select

Export.The Export editor appears on the right-hand side of the Designer.Note: The Export editor displays the objects that you have exported in

the respective areas based on object type —datastores, formats or user functions.

2 In the appropriate object area of the Export editor, right-click each object, and select Export.

3 In the Export Destination window, add the destination database connection information, and select the Export to file: check box.

4 Click Browse to change the destination location for the file.5 Click Finish.

The Output dialog shows the number of exported objects.

Exporting objects to a file


.

Activity: Exporting objects to a file

ObjectiveIn this activity you export a custom function from a repository to a file.

Instructions1 Log into the Data Integrator Designer direpo repository using direpouser

as your user name.2 From the Local object library, right-click the Northwind_Count custom

function, and export the object.The Northwind_Count custom function displays in the User functions to export area of the Export editor.

3 In the Export Editor under the User functions to export area, right-click the Northwind_Count and export it.

4 In the Export Destination window, select Export to a file and browse to a location to save the exported file.After you complete the export to a file, Data Integrator confirms that the functions have been exported in the Output dialog box.

Practice


.

Activity: Exporting objects to a repository

ObjectiveIn this activity you will:• Export a custom function from one repository to another repository• Re-export the exact same custom function and see what options you

have for exporting objects that already exist in the target repository

To complete this activity successfully, you need to use the user1 repository that you created at the end of Lesson 11 “Supporting a multi-user environment”.

Instructions1 Logged in as direpouser in the direpo repository, and export the

Northwind_Count custom function from the Local object library.2 Under the User functions to export area, right-click Northwind_Count,

and select Export.3 Complete the Export Destination box as shown:.

4 Click Next.

Practice


Since we are exporting to a database —user1 local repository— you should see the following indicating that the object does not exist in the target repository.

5 Click Finish to complete the export. The Output dialog box indicates that the object is exported successfully.

6 Now try to re-export the object. You should see a warning message that indicates that the objects already exist in the target repository.In this case, you can click on the either function and either exclude the object from the export or replace the destination object with the same name.

7 In the Target status drop-down list, select Replace.8 Click Finish to complete the export process.

The Output dialog box indicates that one object is exported.


.

Activity: Exporting selected objects from one repository to another

ObjectiveIn this activity you will:• Export the Case_Job to the Export editor• Select to migrate only the Query transformation and the CUSTOMERS

source table from the Case_DF data flow contained within the Case_Job

This activity builds from the last two exercises. To complete this activity successfully, you need to use the user1 and user2 repositories that you created at the end of Lesson 11 “Supporting a multi-user environment”.

Instructions1 Log in as direpouser in the direpo repository, and export the Case_Job

from the Local object library.You should see the following objects appear on the right hand side of the Export editor as shown below. In addition, you should also see a list of datastores under the Datastores to export area in the Export editor

2 From the Datastores to export area, exclude the target tables in the export as shown below:Note: The Exclude option only excludes the object selected. However,

since you want to exclude all the target tables, select Exclude Tree. The Exclude tree option is available only under the Datastores to export area from the Export editor.

Practice


The Objects to export area in the Export editor should now look like this:

You should also see the two custom functions we exported from the previous activities in the User functions to export area in the Export editor.

3 Right-click each function and select to exclude or delete the function from this export since we have already exported the two custom functions.

4 Right-click Case_Job to export it.5 In the Export Confirmation window, expand the Datastores category. You

should only see the source datastore listed since we excluded the target datastore from the export earlier.

.6 Click Next to display the Datastore Export Options window and click

Finish to complete the export.7 Log out of the Data Integrator Designer and log back in using the user1

repository and user name.The user1 repository Local object library should display the two functions that we exported before. You should also see Case_Job and the Northwind datastore that you just exported.


You can also export and import an entire repository to a file.

Exporting a repository to a file allows you to send this file to someone via email. After it is received the exported repository file can then be imported into a different repository by another user.

Exporting a repository to a file is also useful for backing up repositories.

You automatically export or import all the jobs and the job’s respective schedules when you export or import a repository to a file. However, you cannot change a job’s environment-specific information using this option.

Importing objects or an entire repository from a file also overwrites existing objects with the same names in the destination repository.Note: Schedules cannot be exported or imported without an associated job

and its repository.

To export a repository to a file1 Right-click in the Designer Object Library space, and select Repository.2 From the drop-down list, select Export To File.3 In the Write Repository Export File window, specify the destination of the

export file and type the name of the export file.4 In the Save as type: list, select the file type for your export file.5 Click Save.

The repository is exported to the file.

To import a repository from a file1 Right-click in the Designer Object Library space, and select Repository.2 From the drop-down list, select Import From File.3 In the Write Repository Export File window, specify the destination of the

export file and type the name of the export file.4 In the Save as type: list, select the file type for your export file.5 Click Open.

A warning message displays to let you know that it takes a long time to create new versions of existing objects.

6 Click OK.You must restart Data Integrator after the import process completes.

Exporting and importing a repository to a file


.

Activity: Exporting a repository to a file

ObjectiveIn this activity you will:

Export a repository to a file

Import it into another repository for back up purposes

To complete this activity sucessfully, you need to use the user2 repository that you created at the end of Lesson 12 “Supporting a multi-user environment”.

Instructions1 Log out of the Data Integrator Designer, and log back in to the direpo

repository.2 In the Local object library make sure you can see a list of all your jobs as

shown:

3 Right-click anywhere in the Local object library, select Repository and Export to File.

4 Type direpo_export as the name for your export file and browse to a location to save the file.

Practice


The export will take some time, depending on how many objects you have. A successful export message displays on the bottom left hand side of the Data Integrator Designer as shown:

5 Log out of the Data Integrator Designer and log back in as user2 so we can import the repository file into the user2 repository. You should see that the project list in the user2 Local object library Project tab is empty.

6 Right-click anywhere in the Local object library and select to import the Repository from a file.

7 Browse to the location where you saved direpo_export.atl and open the file.A warning message displays to let you know that you are creating new versions of existing objects.

8 Click OK to continue.The imported objects display in the Local object library under the corresponding object tabs.


Lesson Summary

Quiz: Migrating Projects1 Name the two migration methods available in Data Integrator. What

would you need to consider when choosing each method?

2 List the tasks that are associated in a multi-user migration environment.

3 True or False. The export process allows you to change environment-specific information defined in datastores and file formats to match the new environment.

4 True or False. You can only export objects to a file or repository.

5 What is the advantage of using the datastore configurations during migration?

6 True or False. Logging in a Central Repository is not recommended.

Review


After completing this lesson, you are now able to:• Prepare for migration• Describe migration mechanisms and tools• Choose a migration mechanism• Create multiple configurations in a datastore• Use the Rename Owner tool to rename database objects• Create a system configuration• Distinguish between a phased and a versioned model for multi-user

migration• Add a project to a central repository• Get the latest version of a project• Update the project• Copy contents between central repositories• Import and export objects to a repository• Export objects to a file• Export a repository to a file

Summary


Lesson 13Using the Administrator

The Data Integrator Administrator is your primary monitoring resource for all jobs designed in the Data Integrator Designer. Using the Administrator you can access many administrative tasks depending on the type of job or application with which you are working.

This lesson describes some of the options available in the Administrator but focuses on administering batch jobs only.

In this lesson you will learn about:• Using the Administrator• Managing the Administrator• Managing batch jobs with the Administrator• Understanding server group basics• Understanding central repository security basics

Duration: 1 hour


Using the Administrator

Data Integrator provides two administrative interfaces, the Designer and the Administrator. The Designer supports batch job administration in development and testing phases of a project. The Administrator provides central administration for multiple repositories and Job Servers in all phases of a project.

After completing this unit, you will be able to:• Log into the Administrator• Describe the Administrator interface.

The Data Integrator Administrator is web-based and requires only a web browser at the client end.

The Administrator allows you to:• Monitor, schedule, and execute batch jobs on multiple Job Servers and

repositories from a central location• Support multiple Job Servers with a single local repository for your

production jobs• Specify which Job Server executes each job for load balancing• Track when multiple jobs are executed on a single Job Server

Using the Administrator, you can execute, schedule, and support third party schedulers for batch jobs. The Administrator also gives you the ability to view, delete, and set retention periods for log files.

Use the default user name (admin) and the default password (admin) the first time you log into the Administrator.

If you encounter an error when you log in, check that the Data Integrator Web Server service is running. The Web Server and its services are included with every Administrator installation. If the Web Server service is running, but you still cannot log in, see “Troubleshooting”, Chapter 10 in the Data Integrator Administrator Guide.

Introduction

Logging into the Administrator

Using the Administrator—Learner’s Guide 13-3

To log into the Administrator1 Click the Start button, and then point to Programs. Point to the folder

that contains BusinessObjects Data Integrator and then click Web Administrator.

2 Log into Data Integrator Administrator using the default (admin), user name and password.The Administrator home page displays the status of the repositories, Access Servers, and adapters that are connected to the Administrator. The red, green, and yellow icons indicate the overall status of each item based on the jobs, services, and other objects they support.Note: Data Integrator Administrator sessions time out after 30 minutes of

user inactivity. Log into Administrator again when this occurs.

The layout of the Data Integrator Administrator consists of a window with a navigation tree on the left and pages on the right. You use the navigation tree to access many administrative tasks depending on the type of job or application with which you are working.

Describing the Administrator interface


The navigation tree is divided into nodes:• Batch node

The Batch node in the Administrator displays all the repository connections you have configured with the Administrator. For each repository listed, a status and configuration tab allows you to open pages to work with jobs in the selected repository.The Status tab displays the status of executed jobs and job logs filtered by the status interval you specified. The Configuration tab displays job information sorted and filtered by project.

• Real Time nodeThe Real Time node is used to work with real time services.

• Web Services node The Web Services node is used to work with web services.

• Adapter Instances nodeThe Adapter instances node is used to configure a connection between Data Integrator and an external application by creating an adapter instance and dependent operations.For more information on administering real-time jobs, web services and adapters see the Data Integrator Administrator Guide. For more information on real-time jobs and adapters see the Data Integrator Designer Guide.

• Server Groups nodeThe Server Groups node is used to group Job Servers that are associated with the same repository into a server group.

• Central RepositoriesThe Central Repositories includes these nodes:• Users and groups: used to add, remove, and configure users and

groups for secure object access.• Reports: used to generate reports for central repository objects such

as viewing the change history of an object.• Profiler repositories

Stores information generated by the Data Profiler for determining the quality of your data.

• Management nodeThe Management node includes the repositories, users, access servers, status interval and log retention period nodes.


Managing the Administrator

The Management node allows you to configure management features in the Data Integrator Administrator.

After completing this unit, you will be able to:• View available repositories • Add a repository • Add user roles

The Administrator allows you to manage batch jobs.

The repository containing the jobs you want to manage must be available in the Administrator before you can execute, schedule, or monitor any jobs.

To view available repositories in the Administrator• In the Administrator navigation tree, expand Management, and click

Repositories.

To add a repository to the Administrator1 In the Administrator navigation tree, expand Management, and click

Repositories.2 On the List of Repositories page, click Add.3 On the Add/Edit Repository page, in the Repository Name: field, type the

name of the repository you want to add.4 In the Database Type: list, select the database type.5 In the remaining information boxes, type the machine name, datasource

name, user name and password for the repository you are adding.6 Click Test to test the database information you have specified for the

repository.7 Click Apply.

The Administrator validates the repository connection information and displays it on the List of Repositories page.

Introduction

Viewing available repositories

Adding a repository


You can add two types of user accounts to the Administrator:• Administrator

This role provides access to all of the Administrator’s functionality.• Monitor

This role provides access to Status pages (to monitor, start, and stop processes) and the Status Interval and Log Retention Period pages. For example, a monitor role can abort batch jobs but cannot execute or schedule them. A monitor role can restart, abort, or shut down an Access Server, service, adapter instance, or client interface but not add or remove them.

Note: Accounts you create for the Administrator role also control access to the Metadata Reports tool. Repositories connected to the Administrator are also available to Metadata Reports users. The Metadata Report Tool is covered in Lesson 16 of this course.

To add a user to the Administrator1 In the Administrator navigation tree, expand Management and click

Users.2 On the User Management page, click Add.3 On the Add/Edit User page, in the Login name: and Password: fields,

type the user’s login information.Note: Login names and passwords for the Administrator do not need to

match those for your system or repository.4 In the User Name: field, enter another identifier for the user.

If you have trouble recognizing a user name, you can use this value to label the account.

5 In the Role: list, select the type of role that you want to assign to this user.6 In the Status: list, select Active.7 Click Apply.

Adding user roles


Managing batch jobs with the Administrator

You can specify the time period for which the Administrator displays the status for each batch job. Using the Administrator you can execute, schedule, and monitor batch jobs.

After completing this unit, you will be able to:• Set the status interval for displaying job executions• Set the log retention period• Execute batch jobs• Schedule jobs• Monitor jobs

Display intervals for job execution can be set by the last execution time, number of days, or by date.

To set the status interval1 In the Administrator navigation tree, expand Management and click

Status Interval.2 In the Specify a time period for which job execution status is shown area,

click the time period you want to use.

3 Click Apply.4 In the Status Interval page, click Home.

Introduction

Setting the status interval for job execution


5 In the Batch Job Status page, click the repository for which you want to display the status.The Administrator updates the list of job executions and the status interval displays in the table title on the Batch Job Status page.

You can set the log retention period to help you automatically delete log files after a given number of assigned days.

The logs retain information on:• Historical batch job error, trace, and monitor logs• Current service provider trace and error logs• Current and historical Access Server logs

The Administrator deletes all log files beyond the specified period. For example, If you enter 1, then the Administrator displays the logs for today only. After 12:00 AM, these logs clear and the Administrator begins saving logs for the next day. If you enter -1, Data Integrator will not save logs.

To set the log retention period1 In the Administrator navigation tree, expand Management, and click Log

Retention Period.2 In the Log retention period (in days): field, type the number of days you

want to retain logs before the logs are deleted.3 Click Apply.

Changes you make to the log retention period occur as a background clean-up process so they do not interrupt more important message processing. These changes may also take up to an hour to apply and you may not see logs deleted immediately.

Setting the job log retention period


You can execute batch jobs from the Administrator if their repositories are connected to the Administrator.

To execute a job from the Administrator1 In the Administrator navigation tree, click Batch and select the repository

containing the job you want to execute.2 In the Batch Job Status page, click the Configuration tab.3 Under the Action column for the job, click Execute.4 In the Execute Batch Job page, select the execution and trace options for

the job.

5 Click Execute.The Administrator returns to the Batch Job Status page.

Executing jobs


You can use the Administrator to define a new schedule for a job, view a schedule, activate or deactivate a schedule, and remove a job schedule.

There are two ways to schedule a job in the Data Integrator environment:• Using the Administrator• Using a third-party scheduler

Scheduling with the AdministratorUsing the Administrator to schedule batch jobs creates an entry in the scheduling utility on the Job Server computer. In a Windows platform, the task scheduling utility is the AT.exe file. Note: The Administrator does not reflect any changes made to the schedule

directly through this utility.

Changes made to the Job Server, such as an upgrade, also do not affect schedules created in Data Integrator as long as:• The new version of Data Integrator is installed in the same directory as

the original version (Data Integrator schedulers use a hard-coded path to the Job Server).

• The new installation uses the Job Server name and port from the previous installation.

When you schedule a job with the Administrator you must:• Specify a schedule name.• Determine if the scheduling job is going to be active.

An active scheduling job allows you to create several schedules for a job and then activate the one you currently want to run.

• Determine if you want the job to be on a recurring schedule.• Specify if you want the job to run on a schedule name on a day of a week

or a given date in a month.• Specify the start time for the job.• Determine if you want to enable recovery for the job.• Select to recover from last failed execution if an execution of this job has

failed and you want to enable the Recovery mode.• Select the system profile to use when executing this schedule.• Select the Job Server or a server group to execute this schedule.

Scheduling jobs


To define a new schedule for a job in the Administrator1 In the Administrator navigation tree, click Batch and select the repository

containing the job you want to execute.2 In the Batch Job Status page, click the Configuration tab.3 Under the Action column for the job, click Add Schedule.

The Schedule Batch Job page displays different options that you can use when scheduling a job execution.

4 In the Schedule Batch Job page, define the options for scheduling the job.Note: You must activate the schedule for the job by selecting the Activate

option in the Schedule Batch Job page.5 Click Apply.


To view a schedule1 In the Administrator navigation tree, click Batch and select the repository

containing the job you want to execute.2 In the Batch Job Status page, click the Configuration tab.3 Under the Other Information column for the job, click Schedules.

The Batch Job Schedules page lists scheduled jobs and provides a link to each schedule.

4 Under the Schedule column, click the job you want to view.Tip: You can also view the pending job in your computer’s control panel

under Scheduled tasks.

To remove, activate, or deactivate a schedule1 In the Administrator navigation tree, click Batch and select the repository

containing the job you want to execute.2 In the Batch Job Status page, click the Configuration tab.3 Under the Other Information column for the job, click Schedules.4 Select the job with which you want to work.5 Do one of the following:

• Click Remove to remove the schedule.• Click Activate to activate the schedule.• Click Deactivate to deactivate the schedule.

Scheduling with a third party toolUsing a third party tool to schedule a job initiates a job outside of Data Integrator. The job runs from an executable batch file for Windows-based programs that is exported from Data Integrator.Note: When a third-party scheduler calls a job, the corresponding Job Server

containing the job must be running.

Data Integrator exports job execution command files as batch files on Windows. These files pass parameters and call the Data Integrator Job Launcher, AL_RWJobLauncher. The AL_RWJobLauncher then executes the job and sends it to the appropriate Data Integrator Job Server and waits for the job to complete.

These batch files are stored in your <Data Integrator\Log> directory. For example, assuming a default installation, you can browse to: C:\Program Files\Business Objects\Data Integrator 11.5\Log.Note: Under normal circumstances, you should not modify the exported file;

do so only under the direction of Business Objects Customer Support.

The following shows a sample Windows NT batch file created when Data Integrator exports a job. ROBOT is the host name of the Job Server computer.


All lines after inet:ROBOT:3513 are AL_Engine arguments, not AL_RWJobLancher arguments.

The following table lists job launcher flags and their values:

There are two arguments that do not use flags:• inet address: this is the host name and port number of the Job Server.

The string must be in quotes, for example: "inet:HPSVR1:3500"If you use a server group, inet addresses are automatically rewritten using the -S flag arguments. On execution, the first Job Server in the group checks with the others and the Job Server with the lightest load executes the job.

Flag Value

-w The job launcher starts the job(s) and then waits before passing back the job status. If -w is not specified, the launcher exits immedi-ately after starting a job.

-t The time, in milliseconds, that the Job Server waits before checking a job’s status. This is a companion argument for -w.

-s The status or return code. 0 indicates successful completion, non-zero indicates an error condition.

Combine -w, -t, and -s to execute the job, wait for completion, and return the status.

-C The name of the engine command file; path to a file which contains the Command line arguments to be sent to the engine.

-v This prints AL_RWJobLauncher version number.

-S This lists the server group and Job Servers it contains using the fol-lowing syntax:.SvrGroupName;JobSvr1Name:JobSvr1Host:JobSvr1Port;JobSvr2Name:JobSvr2Host:JobSvr2Port.;

For example:“SG_DEV;JS1:HPSVR1:3500;JS2:WINSVR4:3505.;


• Server log path: this is the fully qualified path to the location of the log files. The server log path must be in quotes. The server log path argument does not appear on an exported batch job launch command file. The server log path appears only when Data Integrator generates a file for an active job schedule and stores it in the following directory: LINK_DIR/Log/JobServerName/RepositoryName/ JobInstanceName.Note: LINK_DIR is the installation directory, for example,

c:\DataIntegrator.You cannot manually edit server log paths.

The job launcher also provides error codes to help you debug potential problems. The error messages are:

When you schedule a job with a third party tool you must:• Specify a file name for the batch file containing the job.The Administrator

automatically appends the .bat extension to the file name specified.• Select a system profile to use when executing this job.• To schedule a Data Integrator job with a third party scheduler, you must

first export the job into a .bat file and then execute the .bat file.• Select the Job Server or server group to execute this job.• Decide if you want to enable the automatic recovery feature. When this

feature is enabled, Data Integrator saves the results from completed steps and allows you to resume failed jobs.

• Decide if you want to enable recovery from last failed execution for the job. When recovery is enabled, Data Integrator retrieves the results from any steps that were previously executed successfully and executes again any required steps. Note: This option is a run-time property and is not available when a job

has not yet been executed or when recovery mode was disabled during the previous run.

Error number Error Failure

180002 Network failure.

180003 The service that will run the schedule has not started.

180004 LINK_DIR is not defined.

180005 The trace message file could not be created.

180006 The error message file could not be created.

180007 The GUID could not be found. The status cannot be returned.

180008 No command line arguments were found.

180009 Invalid command line syntax.

180010 Cannot open the command file.


To schedule a Data Integrator job to a third party tool, you must first export the job to a batch file from the Administrator and then execute the batch file in the third party scheduler.

To export a job for scheduling in a third party tool1 In the Administrator, click Batch and select the repository that contains

the job you want to schedule.2 Click the Configuration tab.

The Batch Job Configuration page displays listing all the jobs in the connected repositories.

3 Under the Action column, click Export Execution Command for the job you want to export for scheduling.The Export Execution Command page displays the following options:

4 Enter the parameters for the batch job command file you want the Administrator to create.

5 Click Export.The Administrator creates a batch (.bat) file for the job and writes this file to the local Data Integrator log directory and returns to the Batch Job Configuration page.The default installation path is C:\Program Files\Business.

6 Schedule the batch file to execute from the third-party software.


Using the Administrator, you can monitor the overall status and statistics for job execution of any batch job in a connected repository.

The Batch node in the Administrator displays all the repository connections you have configured with the Administrator. For each repository listed, a status and configuration tab allows you to open pages to work with jobs in the selected repository.

The Status tab displays the status of executed jobs and job logs filtered by the status interval you specified. You can access Trace, Monitor, and Error information for each job under the Job information column on the Status tab.

Monitoring jobs


Activity: Scheduling and executing jobs in the Web Administrator

ObjectiveIn this activity you will work with log files and schedule a job to execute automatically.

Instructions1 Launch the Data Integrator Administrator.2 Login and use admin for your user name and password.3 Create another user with the following properties:

• User name: your first name• Password: your first name• Role: monitor

4 Logout of the Administrator and log in again as your new user.Note: You cannot see the Management node when you log in as your

new user because you only gave your new user a monitor role.5 Logout of the Administrator and log back in as the admin user.6 Add the direpo repository to the Administrator. Test the connection.7 Expand the Batch node in the Administrator. You should see the direpo

repository you just added under the Batch node.8 Click direpo and you should see all the jobs for the direpo repository

listed.9 In the Status tab, click to view the log files for previously executed jobs.

View the trace, monitor, and error logs for the job you just selected.10 In the Configuration tab, click Execute for the Case_Job and enable

recovery in the job execution options. 11 View the Trace logs for the Case_Job you just executed.12 View the Monitor logs for the Case_Job you just executed.13 View the Error logs for the Case_Job you just executed. 14 Add a schedule to Case_Job to execute two minutes from the current

time. Set this schedule to run on a weekly basis.Note: Make sure you select the activate option when adding the

scheduling the job.You can make sure the scheduler is enabled by checking that the Task Services in your computer is working.

15 Check your services in the administrative tools found in your computer's control panel by going to Start > Programs > Accessories >System Tools > Scheduled Tasks. If this service is not enabled, right-click it and restart the service.

Practice


16 Verify the Scheduled Tasks Window is set to View > Details.Notice the schedule file name —At1—, the schedule to run and the next run time for the schedule.


Activity: Scheduling and executing jobs in a third- party tool

ObjectiveIn this activity you will export a job for scheduling in a third-party tool.

Instructions1 Launch the Administrator and navigate to Case_Job.2 Export Case_Job to a batch file.3 Browse to the Data Integrator log directory and find the Case_Job.bat

file.4 Right-click Case_Job.bat and select Edit to view the job launcher flags

and values created.5 Manually execute the batch file by double-clicking the bat file. You should

see a command screen flashing when the job executes.

Practice


Understanding server group basics

You can use the Administrator to create and maintain server groups.

After completing this unit, you will be able to:• Understand architecture, load balance index, and job execution in server

groups• Work with server groups and Designer options• Add a server group• Edit and remove server groups• View Job Server status in server groups

ArchitectureThere are two rules for creating server groups:• All the Job Servers in an individual server group must be associated with

the same repository.• Each computer can only contribute one Job Server to a server group.

Associating all the Job Servers in a server group to the same repository allows you to track more easily the jobs that are associated within a server group.

It is recommended that you use a naming convention for server groups that includes the name of the repository. For example, for a repository called DEV, a server group might be called SG_DEV.

During startup, the Job Servers check the repository to find out if they must start as part of a server group.

Introduction

Architecture, load balance index, and job execution

Computer 1

Server Group (SG_DEV)

Job server 1

Job server 2

Job server 3

Job server 4

Job server 6

Job server 5

Computer 2

Computer 3


Compared to normal Job Servers, Job Servers in a server group can each:• Collect a list of other Job Servers in their server group• Collect system load statistics every 60 seconds. This includes the

number of CPUs (on startup only), average CPU load, and available virtual memory

• Service requests for system load statistics• Accept server group execution requests• Load balance index

All Job Servers in a server group collect and consolidate system load statistics and convert them into a load balance index value for each Job Server. A Job Server’s load balance index value allows Data Integrator to normalize statistics taken from different platforms. The Job Server with the lowest index value is selected to execute the current job. Data Integrator polls all Job Server computers every 60 seconds to refresh the load balance index.

Job executionYou can execute a batch job within a server group from the Designer or Administrator interface.

When you execute a job using a server group, the server group executes the job on the Job Server in the group that is running on the computer with the lightest load. The Administrator will also resynchronize a Job Server with its repository if there are changes made to the server group configuration settings.

Some Designer options such as the ones listed below assume that their paths are relative to a Job Server:• Source and target directories for files• Bulk load directories• Source and target connection strings to databases• Path to repositories

If your Job Server resides on a machine different from where the Designer has been installed, you must ensure that connections and directory paths point to the Job Server host that will run the job.

When using server groups, consider the additional layer of complexity for connections. For example, if you have three Job Servers in a server group:• Use the same directory structure across your three host computers for

source and target file operations and use relative paths for file names.• Use the same connection strings to your databases for all three Job

Server hosts.Tip: Thoroughly test Job Server-centric job options when working with server

groups.

Working with server group and Designer options


You create and add server groups using the Server Group node in the Administrator tree.

Keep in mind that:• The Job Servers registered with a given repository are displayed. You

can create one server group per repository.• One Job Server on a computer can be added to a server group.

Tip: Use the Host and Port column to verify that the Job Servers you select are each installed on a different host.

To add a server group1 In the Administrator navigation tree, expand Server Groups, and click

All Server Groups.2 In the Server Group Status page, click the Configuration tab.3 Click Add.4 In the Repository: drop-down list, select a repository.5 In the Server Group Name: field, type a server group name.

Tip: The default server group name is the prefix SG and the repository name. Use the prefix SG in your server group name to help you distinguish the group as a server group.

6 Under the Job Server Name column, select the Job Servers you want to add to your group.

7 Click Apply.

Adding a server group


You can select a new set of Job Servers for an existing server group or remove a server group.

Trace messages are written when you create, edit, or make changes in a Job Server status. For example:• When a Job Server is upgraded to membership in a server group, the

trace message is: Collecting system load statistics, maintaining list of Job Server(s) for this server group, and accepting Job Server execution requests.

• When a Job Server is downgraded out of a server group, the trace message is:Deleting current system load statistics, and not collecting more. Not accepting job execution requests from a server group.

To edit a server group1 In the Administrator navigation tree, expand Server Groups, and click

All Server Groups.2 Click the Configuration tab.3 In the Server Group Configuration page, click the server group that you

want to edit.4 In the Edit Server Group page, select a new set of Job Servers.5 Click Apply.

Your edited server group is saved.

To remove a server group1 In the Administrator navigation tree, expand Server Groups and click All

Server Groups.2 In the Server Group Status page, click the Configuration tab.3 In the Server Group Configuration page, select the check box of the

server group(s) that you want to remove.4 Click Remove.

The selected server group is removed as shown in the display.Note: The Administrator displays an invalid status for the server group if

you delete all the Job Servers in the server group from a repository.

Editing and removing server groups


You can view the status of the Job Servers in a server group in the Administrator.

The status of a Job Server is indicated by different colors:• A green indicator signifies that a Job Server is running• A yellow indicator signifies that a Job Server is running but not

responding• A red indicator signifies that the Job Server is not running

If a server group contains Job Servers with a mix of green, yellow, or red indicators, then its indicator appears yellow. Otherwise, a server group indicator displays the same color indicator as the Job Servers in the Job Server group.

To view Job Server status• In the Administrator navigation tree, expand Server Groups, and click

All Server Groups.

The Server Group Status page opens. All existing server groups are displayed with the Job Servers they contain.

Viewing Job Server status in server groups


Understanding central repository security basics

Data Integrator provides options for managing secure access and tracking for objects in central repositories.

After completing this unit, you will be able to:• Explain central repository security• Create a secure central repository• Define a connection to a secure central repository• Implement group permissions• View and modify permissions

The mechanisms for managing this security include:• Authentication: allows only valid users to login to a central repository.• Authorization: grants various levels of permissions to objects.• Auditing: maintains a history of changes made to an object including user

names.Note: These security procedures only apply to central repositories.

You can create a secured central repository or upgrade a nonsecure central repository to a secured central repository.

When you implement central repository security you must:1 Create a secure central repository or upgrade an existing nonsecured

central repository to a secured one.2 Add and configure groups and users with the Data Integrator

Administrator.3 Define a connection to the secure central repository.4 View and modify permissions in the Central object library.

Introduction

Explaining central repository security


To create a secure central repository1 In the Data Integrator Repository Manager, create a new central

repository, and select the Enable Security check box.Data Integrator creates repository tables in the database you identified and creates a security key file, databaseserver_database_user.key:

To upgrade a nonsecured central repository to a secured central repository1 In the Data Integrator Repository Manager, under Repository Type, click

Central.2 Enter the database connection information for the central repository to

modify.3 Select the Enable security check box.4 Click Upgrade.

Data Integrator updates the repository tables in the database you identified and creates a security key file named: databaseserver_database_user.key.

A user that is part of a group with permissions or an administrator can define a connection to a secured central repository.

To define a connection to a secure central repository1 Start the Data Integrator Designer and log into your local repository. 2 From the Tools menu, click Central Repositories to open the Options

window.The Central Repository Connections option should be selected in the Designer list.

3 Click Add.The Central Repository Editor opens.

4 In the Repository Name field, enter a name to identify the connection to this central repository.Note: This name is only visible in the Options window in the Central

Repository Connections list.

Creating a secure central repository

Defining connections to a secure central repository


5 Select the Secure check box to enable the Repository User Information fields and the Security key options.

6 In the Database Connection Information area, click Read Security Key to import the database connection information values from the key that Data Integrator generated when you created the secure central repository using the Repository Manager.Note: You can also type the appropriate database and user information

manually, and click Generate Security Key, to create databaseserver_database_user.key to save this information for future use.

7 In the Repository User Information area, type the user name and password as defined in the Administrator.Tip: Select the Remember check box to store the information for the

next time you log in.8 Click OK.

The list of central repository connections now includes the newly connected central repository and it is identified as being secure.

From the Central repository connections area, select the secure central repository, and click Activate.

Implement security for a central repository by establishing a structure of groups and associated users using the Administrator.

Access permissions for objects apply at the group level. More than one group can have the same permissions to the same object simultaneously. Groups are specific to a repository and are not visible in any other local or central repository.

Users select from the group(s) to which they belong in the Designer and the selected (current) group dictates their access to that object. Each user must have one default group but can belong to more than one group. When a user adds an object to a secure central repository, the user’s current group automatically has full permissions to that object.Note: User name and password authentication is required every time users

log into a secure central repository. Users can change their passwords at any time from the Central Repository Editor in the Designer.

Permission levelsEach object in a secure central repository can have one of these permissions levels:• Full

This is the highest level of permission. The group can perform all possible actions including checking in, checking out, and deleting the object. You might assign this type of access to developers, for example.

Implementing group permissions


• ReadUsers can only get a copy of the object from the central repository or compare objects between their Local and Central object libraries. For example, you might assign this type of access to QA.

• NoneUsers cannot get copies of the object but can view it and its properties.

When an authenticated user adds an object to a secure central repository, the user’s current group receives Full permissions to the object. All other groups receive Read permissions. Members of the group with Full permissions can change the other groups’ permissions for that object.

When you create a secure central repository, the repository name appears under the Central Repositories node. Links under this node include:• Users and groups

Use to add, remove, and configure users and groups for secure object access.

• ReportsUse to generate reports for central repository objects such as viewing the change history of an object.

To add groups to a central repositoryGroups are specific to a repository and are not visible in any other local or central repository.1 In the Administrator navigation tree, expand Central Repositories.2 Expand the repository to configure, and click Users and Groups.

The Groups and Users page displays.3 On the Groups tab, click Add.4 Type a name for the group.5 Type a description for the group.6 Click Apply.

The group appears on the Groups tab.

To add users to a central repositoryNote: When a user adds an object to a secure central repository, the user’s

current group automatically has full permissions to that object.1 In the Administrator navigation tree, expand Central Repositories.2 Expand the repository to configure, and click Users and Groups.3 Click the Users tab.4 Click Add.


5 On the Add/Edit User page, enter this information:• UsernameNote: You can type a new user name since user names and passwords

in the Administrator do not need to match those for your system or repository.

• Password• Display nameTip: Enter another identifier for the user such as the full name. If you

have difficulty recognizing a user name, you can use this value to label the account.

• Default group: the default group to which the user belongs. You can change the default by selecting another from the drop-down list.

• Status:• Active: enables the user’s account for normal activities.• Suspended: select to disable the login for that user.

• DescriptionNote: The User is a member of list on the left shows the groups to which

this user belongs.6 Click Apply.

Use the Central object library to view and modify group permissions for objects that you have added to the secure central repository.

To view permissions for an object1 In the Data Integrator Designer log into your local repository.2 On the Designer toolbar, click Tools > Central Repositories.3 Select the secure central repository and click Activate.

4 On the Designer toolbar, click to display all the objects available in the central repository.Your default group appears in the drop-down list at the top of the window and is marked with an asterisk. The Permissions column displays the current group’s access level for each object. If you add a new object to

Viewing and modifying permissions


the Central object library, the current group gets Full permissions and all other groups get Read permission:

To modify object permissions for other groupsNote: You must have Full permissions to change object access for other

groups.1 In the Central object library, right-click the object and click Permission >

Object or Permission > Object and dependants.The Permission window displays a list of available groups and the group’s access level to the object(s).

2 Click in the Permission column.3 From the drop-down list, select a permission level for the group.4 Click Apply.5 You cannot modify permissions with filtering in this version.


Lesson Summary

Quiz: Using the Administrator1 What are the functions of the Designer and the Administrator?

2 Which type of user role provides access Status pages (to monitor, start, and stop processes) and the Status Interval and Log Retention Period pages.

3 What are the two rules for creating server groups?

After completing this lesson, you are now able to:• Log into the Administrator• Describe the Administrator interface• View available repositories• Add a repository• Add user roles• Set the status interval for displaying job executions• Set the log retention period• Execute batch jobs• Schedule jobs• Monitor jobs• Understand architecture, load balance index, and job execution in server

groups• Work with server groups and Designer options• Add a server group• Edit and remove server groups• View Job Server status in server groups• Explain central repository security

Review

Summary


• Create a secure central repository• Define a connection to a secure central repository• Implement group permissions• View and modify permissions


Lesson 14Profiling Data

Using the Data Profiler allows you to see the actual data so that you can evaluate it before designing your data transformations jobs.

In this lesson you will learn about:• Using the Data Profiler

Duration: 1 hour


Using the Data Profiler

Data profiling is a feature that you can use to understand your source system. The Data Profiler is not a separate component outside of Data Integrator, but you must perform several definition and configuration steps before you can use the Data Profiler. First, you create a Data Profiler repository. Once created, you can then add Profiler users and configure Data Profiler tasks. The final step is to connect the Designer to the Data Profiler server.

After completing this unit, you will be able to:• Explain what data profiling is• Set up a Data Profiler repository• Add Data Profiler users• Configure Data Profiler tasks• Connect the Data Profiler Server to the Designer• Submit a profiling task and view generated profile information• Monitor profiling tasks in the Administrator

Data Profiling is important because it allows you to understand source systems better. It gives you complete visibility into the quality and content of various data sources.

Trusted data can be achieved through effective data profiling. You can uncover data anomalies by inspecting the true content, distribution, structure and relationship within enterprise data sources.

You can verify that the metadata information provided to you is indeed valid before you start the design of an ETL project. You can also discover the quality of your data. For example, you can look at the number of nulls and distinct values in a specified column.

The Data Profiler allows you to obtain information that you can use to determine:• The quality of your source data before you extract.

The Data Profiler can identify anomalies in your source data to help you better define corrective actions in the validation, data cleansing, or other transformations

• The distribution, relationship, and structure of your source data to better design your Data Integrator jobs and data flows, as well as your target data warehouse

• The content of your source and target data so that you can verify that your data extraction job returns the results you expect

Introduction

Explaining what data profiling is

Profiling Data—Learner’s Guide 14-3

Use the Data Profiler to generate profiling tasks and to collect information that multiple users can view, such as:• Column analysis

This information includes minimum value, maximum value, average value, minimum string length, and maximum string length. You can also generate detailed column analysis such as distinct count, distinct percent, median, median string length, pattern count, and pattern percent.

• Relationship analysisThis information identifies data mismatches between any two columns for which you define a relationship, including columns that have an existing primary key and foreign key relationship.

You can use the information generated from profiling tasks to assist you in many different tasks, such as:• Obtaining information on basic statistics, frequencies, ranges and outliers• Identifying variations of the same data content • Discovering and validating data patterns and formats • Analyzing a numeric range• Identifying and validating redundant data and relationships across data

sources • Identifying duplicate name and address and non-name and address

information • Identifying missing data, nulls and blanks in the source system

To enable profiling within Data Integrator, you have to create a Data Profiler repository, associate it to the same or different Job Server as your Designer, and add it to the Administrator. If you intend to have different users share the same profiling results, you want to dedicate a Job Server for the Data Profiler repository. The next section discusses how to accomplish these tasks.

You create the Data Profiler repository in the Data Integrator Repository Manager. This repository can be created and connections can be set up during the Data Integrator installation process or separately.

For the purpose of this course, you will create and define the connections to the Data Profiler repository separately.

Setting up a Data Profiler repository


To create a Data Profiler repository1 In Microsoft SQL Enterprise Manager, create a new database that you

are going to use for your Data Profiler repository.2 From the Start menu, point to Programs> BusinessObjects Data

Integrator 11.5, and select Repository Manager.Note: This assumes a default Data Integrator installation path.

3 In the Data Integrator Repository Manager, complete these fields:• Database type: from the drop-down list, select the database type• Database server name: enter your machine’s name• Database name: enter the name for the database you created earlier• User name: enter assigned user name• Password: enter assigned password

4 Under Repository type, select Proflier, and click Create.A message detailing successful creation of the Data Profiler repository appears. You also see a message prompting you to associate the Data Profiler repository to a Job Server and the Administrator.


To associate the Data Profiler repository to a Job Server1 From the Start menu, point to Programs> BusinessObjects Data

Integrator 11.5 > Server Manager2 In the Data Integrator Server Manager, click Edit Job Server Config.3 In the Job Server Configuration Editor, click to select the existing Job

Server.

The Edit, Delete and Resync with Repository options are now enabled. 4 Click Edit.

Note: You can assign the same Job Server as the Designer repository or give it a different Job Server by clicking Add.

5 In the Job Server Properties window, under Associated Repositories, click Add.

6 Under Repository Information enter the information required for the profiling repository created.

7 Click Apply.8 Click OK twice to exit the Job Server Properties and Configuration

windows.9 Click Restart, and then click OK to exit the Server Manager.


To add the Data Profiler repository to the Administrator1 From the Start menu, point to Programs> BusinessObjects Data

Integrator 11.5, and select Data Integrator Web Administrator2 In the Administrator Login page, enter admin as User name and

password, and click Log in.3 In the navigation tree on the left panel, expand Management, and click

Repositories.4 In the List of Repositories page, click Add.5 In the Add/Edit Repository page, enter the corresponding information for

your Data Profiler repository.

• Repository name: enter a logical name for a repository• Database type: from the drop-down list, select the database type• Machine name: enter your machine’s name• Database port: port number of the database. Leave the default value.• Database name: enter the name for the database you created earlier• User name: enter assigned user name• Password: enter assigned password

6 Click Test, and then click Apply.The Data Profiler repository is now added to the list of repositories available in the Administrator:

Note: When you connect a profiler repository, the repository name appears under the Profiler Repositories node in the Administrator navigation tree.


You can use the default admin user name and password to connect to the Data Profiler server or you can add additional profile users. There are two types of users that you can add to Data Profiler:• Profiler user: this user is authorized to manage profiler tasks within a

specified profiler repository.• Profiler administrator: this user can manage tasks in default profiler

repository and can also manage tasks in any profiler repository.Note: The Administrator role also shares the same profiler rights as the

Profiler administrator.

To add Data Profiler users1 In the Administrator, expand Management, and click Users.2 In the User Management page, click Add.3 In the Add/Edit user page, specify the user’s:

• User name• Password• Display name: to avoid possible confusion recognizing a user’s

name, you can use the display name as another identifier for the user, for example, the user’s full name.

• Role: select a Profiler User or Profiler Administrator• Status: keep the default active status value• Proflier repository: specify the profiler repository for this user

4 Click Apply.

Adding Data Profiler users


You can set task execution and task management configuration parameters to control the amount of resources that profiling tasks use to calculate and generate Data Profiler statistics.

Task Execution parametersTask Execution parameters are separated into reading data, saving data and performance parameters as outlined in the table below:

Configuring Data Profiler tasks

ParametersDefault Value

Description

Profiling size(Reading data)

All Number of maximum rows to profile.

You might want to specify a maximum number of rows to profile if the tables you profile are very large and you want to reduce memory consumption.

Sampling rows(Reading data)

1 Profile the first row of the specified number of sampling rows.

For example, if you set Profiling size to 1000000 and set Sampling rows to 100, the Profiler profiles rows number 1, 101, 201, and so forth until 1000000 rows are profiled. Sam-pling rows throughout the table can give you a more accurate representation rather than pro-filing just the first 1000000 rows.

Number of dis-tinct values(Saving data)

100 Number of distinct values to save in the pro-filer repository

Number of pat-terns(Saving data)

100 Number of patterns to save in the profiler repository

Number of days to keep results(Saving data)

90 Number of days to keep profiler results in the profiler repository

Number of records to save(Saving data)

100 Number of records to save for each profile attribute

Rows per com-mit(Saving data)

5000 Number of records to save before a commit is issued


Task Management parameters Task Management parameters are separated into normal and advanced categories as outlined in the table below:

To configure Data Profiler tasks1 In the Administrator, expand Management, and click Profiler

Configuration.2 Keep or change the parameter values listed.

Degree of paral-lelism(Performance)

2 Number of parallel processing threads

File Processing Threads(Performance)

2 Number of file processing threads for file sources

ParameterDefault Value

Description

Maximum con-current tasks

10 Maximum number of profiling tasks to run simultaneously

Refresh interval (days)

0 Number of days that must elapse before a profiler task is rerun for the same table or key columns when the user submits a profiler task.

The default is 0. This means that there is no limit to the number of Data Profiler tasks that can be run per day.

After a profiling task is run, you can use the Update option when viewing the existing task results to override this refresh interval.

Invoke sleep interval (sec-onds)

5 Number of seconds to sleep before the Data Profiler checks for completion of an invoked task.

Invoked tasks run synchronously, and the Data Profiler must check for their completion.

Submit sleep interval (sec-onds)

10 Number of seconds to sleep before the Data Profiler attempts to start pending tasks.

Pending tasks have not yet started because the maximum number of concurrent tasks is reached.

Inactive interval (minutes)

1 Number of minutes a profiling task can be inactive before the Data Profiler cancels it.

ParametersDefault Value

Description


The Designer must connect to the Data Profiler server to run the Data Profiler and view the profiling results.

To connect the Data Profiler server to the Designer1 From the Designer toolbar, click Tools, and select Profile Server Login.

2 In the Profiler Server Login page, enter:• Host: the name of the computer where the Data Profiler Server exists• Port: the port number through which the Designer connects to the

Data Profiler Server. You can leave the default value in this field.• User name: name of user associated with the Data Profiler Server• Password: password of the user associated with the Data Profiler

Server3 Click Test to validate the profiler server location.4 Click Connect to connect to the profiler repository.

Note: When you successfully connect to the Profiler server, the Profiler server icon located to the right of the Job Server icon no longer has a red X on it. In addition, when you move the cursor over this icon, the status bar displays the location of the Profiler server.

Connecting the Designer to the Data Profiler server


Activity: Setting up a Data Profiler repository

ObjectiveIn this activity you will:• Create a Data Profiler repository • Add the Data Profiler repository to the Job Server and Administrator• Add Data Profiler users• Connect the Designer to the profiler serverNote: For the purpose of this exercise you will leave the default configuration

task parameters.

Creating a Data Profiler repository1 In Microsoft SQL Enterprise Manager, create a new database called

DIDataProfiler. You can use the default user name and password sa for this database.

2 From the Start menu, navigate to the Data Integrator Repository Manager.

3 In the Data Integrator Repository Manager, complete these fields:• Database type: from the drop-down list, select

Microsoft_SQL_Server • Database server name: localhost or your machine’s name• Database name: DIDataProfiler• User name: sa• Password: sa

4 Under Repository type, select Proflier, and click Create.

Adding the Data Profiler repository to the Job Server and Administrator1 From the Start menu, navigate to the Data Integrator Server Manager2 In the Data Integrator Server Manager, click Edit Job Server Config.3 In the Job Server Configuration Editor, click to highlight the existing Job

Server, and click Edit.Note: For the purpose of this activity, we will associate the profiler

repository to the existing Job Server.4 In the Job Server Properties window, under Associated Repositories,

click Add. 5 Under Repository Information enter the information required for the

profiling repository:• Database type: from the drop-down list, select

Microsoft_SQL_Server • Database server name: localhost or your machine’s name• Database name: DIDataProfiler

Practice


• User name: sa• Password: sa

6 Click Apply.7 Click OK twice to exit the Job Server Properties and Configuration

windows.8 Click Restart, and then click OK to exit the Server Manager.9 Return to the Start menu, and navigate to the Data Integrator

Administrator.10 Log in using admin as both your user name and password.11 Expand Management, and click Repository. 12 In the Add/Edit repository page, click Add.13 Enter the information required for the profiler repository:

• Repository name: DIDataProfiler• Database type: from the drop-down list, select

Microsoft_SQL_Server • Machine name: localhost or your machine’s name• Database name: DIDataProfiler• User name: sa• Password: sa

14 Click Test, and then Apply to complete the process.

Adding Data Profiler users1 In the Administrator, expand Management, and click Users.2 In the User Management page, click Add.3 In the Add/Edit user page, add your name, and complete the required

information displayed. Assign yourself a Profiler administrator role.4 Click Apply.

Connect the Designer to the Data Profiler server1 From the Designer toolbar, click Tools, and select Profile Server Login.2 Use the user name that you created in the Administrator to log on.


The Data Profiler enables you to generate profiling statistics to derive trusted information from your sources. A profiling task, upon successful completion, saves profiling results in the data profiler repository.

To reduce redundant tasks, you can generate profiling tasks and have different Data Profiler users access the task results in the same Data Profiler repository. Note: If the same table is profiled repeatedly, the latest results are saved in

the Data Profile repository, and the older data is purged.

You can choose to calculate and generate two types of Data Profiler statistics:• Column profile by submitting a column profiler task• Relationship profile by submitting a relationship profiler task

Submitting a column profiling taskYou can calculate Data Profiler statistics for any set of columns you choose. For example, using the Data Profiler, you profile the SUPPLIER ID column only to obtain its maximum and average string length so that you can determine the size of the SUPPLIER ID in your target data warehouse. Note: You cannot profile columns with nested schemas or with the LONG

data type.

The columns you profile can all belong to one data source or to multiple data sources. When you generate statistics on columns, the default statistics are:• Low value (Min)• High value (Max)• Low value count• High value count• Average value (Avg)• Minimum string length• Maximum string length• Average string length• NULL count• NULL percent• Zero count• Zero percent• Blank count• Blank percent

You can also generate additional detailed column attributes:• Distinct count• Distinct percent• Median value• Median string length• Pattern• Pattern count

Submitting a profiling task


Note: These additional detailed profiling column attributes require more memory and CPU resources to generate. Therefore, Business Objects recommends that you generate these only when required.

A default name is generated for each profiling task. You can edit the task name. The default name formats for: • Single source is username_t_sourcename.• Multiple sources selected is

username_t_firstsourcename_lastsourcename.

where t is the type of profile (c for column profile; r for relationship profile), firstsourcename represents the first selected source, and lastsourcename represents the last selected source when selecting multiple sources.Note: Dashes are not allowed in task names.

You can use column profiling to assist you in these tasks:• Obtaining information on basic statistics, frequencies, ranges and outliers

For example, these statistics might show that a column value is markedly higher than the other values in a data source. You might decide to define a validation transform to set a flag in a different table when you load this outlier into the target table.

• Identifying variations of the same data content For example, part number might be an integer data type in one data source and a varchar data type in another data source. You would then decide which data type you want to use in your target data warehouse.

• Discovering and validating data patterns and formatsFor example, phone number might have several different formats, and you decide to convert them all to use the same target format.

• Analyzing a numeric rangeFor example, customer number might have one range of numbers in one source, and a different range in another source. Your target will need to have a data type that can accommodate the maximum range.


To submit a column profile task for a single source1 In the Datastore tab of the Designer Local object library, expand a

datastore, and browse to a table. 2 Right-click the table, and select Submit Column Profile Request.3 From the Submit Column Profile Request window, click the check mark

before the name of any columns that you do not want to profile.

4 From the header row, select Detailed Profiling.By default, the Detailed Profiling column is unchecked. You can also select specific columns to apply detailed profiling to. To do so, under Detailed Profiling, select each column you want to apply detailed profiling to.Note: Keep in mind that the detailed profiling feature consumes a lot of

resources. Use this option only if you wan to profile column distinct count, distinct percent, median value, median string length, pattern, and pattern count.

5 Click Submit to execute profile task.The Data Profiler monitor displays showing name, type, status, timestamp and sources information on the task.

6 Click Refresh to verify that the Data Profiler job completed.Note: You have to manually refresh the Profiler monitor to display

whether a job is complete. If clicking Refresh does not change the status, wait one minute then try clicking Refresh again.

After you click Refresh, the status displays as Done when the profiling task completes successfully.

This monitor displays the most current task and all of the profiler tasks that have executed within a configured number of days. To view the


Profiler monitor window when a profile task is not running, click Tools > Profiler monitor on the Menu bar.

To submit a column profile task for multiple sources1 In the Datastore tab of the Designer Local object library, expand a

datastore, and browse to a table. 2 Right-click the table, and select Submit Column Profile Request.

The Submit Column Profile request window displays:

3 From the Submit Column Profile Request window, under the Sources section, select a table source.

4 In the right-pane of the Submit Column Profile Request window, click to clear any columns you do not want to profile.

5 From the right pane, in the header row, select Detailed Profiling.By default, the Detailed Profiling column is unchecked. You can also select specific columns to apply detailed profiling to. To do so, under Detailed Profiling, select each column you want to apply detailed profiling to.You can apply detailed profiling for all sources by clicking the Detailed Profiling button and selecting Apply to all columns all sources. Note: Keep in mind that the detailed profiling feature consumes a lot of

resources. Use this option only if you wan to profile column distinct count, distinct percent, median value, median string length, pattern, and pattern count.

6 Click Submit to execute the profiling task.7 In the Profiler Monitor, click Refresh to verify that the Data Profiler job

completed.Note: You have to manually refresh the Profiler monitor to display

whether a job is complete. If clicking Refresh does not change the status, wait one minute then try clicking Refresh again.



To view column profile results1 In the Designer Local object library, browse to the datastore, and table on

which you submitted the profile task for.2 Right-click the table, and select View Data.

3 In the View data tab, click to display the Profiler tab.The Profiler tab shows the number of physical records that the Data Profiler processed to generate the values in the grid.Values denoted by n/a indicate that the profile statistic is not applicable to the data type in the column profiled. Greyed out values indicate that the specific profile statistic was bypassed and not generated.The Records displayed is the total number of sampling rows specified under the Data Profiler configuration settings and is not reflective of the total rows in the table profiled.

4 Click an attribute value to view the entire row in the source table. The bottom half of the View Data window displays the rows that contain the attribute value that you clicked. Tip: You can hide columns that you do not want to view by clicking the

Show/Hide Columns icon.Note: When you click Update, you manually override the Refresh Interval

configuration. You use the Refresh Interval to specify the number of times per day a column or relationship profile task can be submitted.


Submitting a relationship profiler taskA relationship profile shows the percentage of non-matching values in columns for two sources. For example, you can submit a relationship profiler task for two different CUSTOMER tables to see the number of non-matching values between the REGION ID column in these two tables.

You can use a relationship profile to assist you in these tasks:• Identify and validate redundant data and relationships across data

sources. For example, two different problem tracking systems might include a subset of common customer-reported problems, but some problems only exist in one system or the other.

• Identify duplicate name and address and non-name and address information.

• Identify missing data, nulls and blanks in the source system. For example, one data source might include region, but another source might not.

The sources can be tables, flat files, or a combination of a table and a flat file. The key columns can have a primary key and foreign key relationship defined or they can be unrelated. For example, one key comes from a datastore while the other key is from a file format. Or you profile tables that have no primary or foreign key constraints to find out how many values in a column of one table do not have matching values in a column of another table.

Using the Data Profiler you can also profile the relationship between the data for specified tables deriving from two different source systems.

When you view the relationship profile results, you can drill down to see the actual data that does not match.

To submit a relationship profiler task1 In the Designer Local object library, browse to the datastore or source,

and expand to display the two tables or sources to which you want to submit the profile task.

2 Right-click the first table or source, select Submit Relationship Profile Request, and click Relationship With.Note that your cursor changes to indicate that you need to select a second table for this profiling task:

3 Browse to the second table or source you want for this profiling task, and click the table to select it.Note: You can browse to tables in the same or different datastore, flat or

xml file.


The Submit Relationship Profile Request window displays:

The Submit Relationship Profile Request window shows a line between the primary key column and foreign key column of the two sources, if the relationship exists. The bottom half of the Submit Relationship Profile Request window shows that the profile task will use the equal (=) operator to compare the two columns. The Data Profiler will determine which values are not equal and calculate the percentage of non matching values.Note: By default, the Data Profiler saves the data only in the columns

that you select for the relationship. If you want to see values in the other columns, select Save all columns data.

4 Click Submit to execute the profiling task.Note: If a primary key and foreign key relationship does not exist

between the two data sources, specify the columns that you want to profile. To specify a relationship between two columns, move the cursor to the first column that you want to select. Hold down the cursor and draw a line to the other column that you want to select.

5 In the Profiler Monitor, click Refresh to verify that the Data Profiler job completed.Note: You have to manually refresh the Profiler monitor to display

whether a job has completed. If clicking Refresh does not change the status, wait one minute then try clicking Refresh again.



To view relationship profile results1 In the Designer Local object library, browse to the datastore, and table for

which you submitted the profile task.2 Right-click the table, and select View Data.

3 In the View data tab, click to display the Relationship profiler tab.The generate Relationship profile results are displayed to show the relationship between the two tables selected.

4 Click the underlined percentage on the diagram to view the key values that are not contained within the other table in the relationship analysis. In this example, you can see that 55.56% of the Employees from the Northwind EMPLOYEES table do not have an EMP_ID match with the ODS EMPLOYEES table. For example, clicking on the percentage on the above diagram displays the actual key values for this table on the right-hand side:


Using the Administrator, you can view the status of all profiling tasks, and cancel or delete profiling tasks with its generated profile statistics.

To view Data Profiler tasks in the Administrator1 Expand the Profiler Repositories node.2 Click on your Profiler repository name.

The Profiler Task Status window displays the task name, describes the names of the tables for which the profiler task was run, the run # (identification number) for the profiler task instance, the date and time the task was last performed and the task status message.Note: The status message is blank when a profiling task runs

successfully. An error message is displayed if the task failed.

To cancel a Data Profiler task1 Click the Select check box beside the profiling task you want to cancel.2 Click Cancel.

To delete a Data Profiler task1 Click the Select check box beside the profiling task you want to cancel.2 Click Delete.

Monitoring Data Profiler tasks in the Administrator


To view statistics on a specific Data Profiler task1 In the Profiler Task Status page, click the task name.

This displays the Profiler Task Items report:

The Profiler Task Items report shows the:• Status for each column on which the profiling task executed. Available

status states are: Done, Pending, Running, and Error• Item representing the column number in the data source on which this

profiling task executed• Job Server where the profiling task executed• Process ID representing the Data Integrator process ID that executed the

profiling task• Profiling Type ranging from:

• Single Table Basic: the column profile with the default profiling statistics

• Single Table Full: the column profile with specified detailed profiling statistics

• Structural Basic: the relational profile with default profiling statistics• Datastore name• Source table profiled• Column name on which the profiling task executed• Last update: the date and time that this profiler task last performed an

action• Status message that displays an error if the profiling task failed. If this is

blank, the profiling task completed successfully


Activity: Submitting a column profiler task

ObjectiveIn this activity you will submit a column profiler task to ensure that you have trusted information by:• Identifying variations of the same data content• Obtaining information on basic statistics, frequencies, ranges, and

outliers• Identifying nulls in the source system• Discovering patterns and formats

Instructions1 In the Designer Local object library Datastore tab, expand Northwind,

and expand Tables.2 Right-click Customers, and select Submit Column Profile Request.3 From the top level, select Detailed Profiling.

Note: You want detailed profiling because you want to look at pattern column attributes in your Data Profiler task.

4 In the Profiler Monitor, click Refresh to verify that the Data Profiler job has completed.

5 In the Designer Local object library Northwind Datastore, expand Tables and right-click Customers.

6 Select View data, and click to display the Profiler results.You may need to expand the screen displayed and scroll to the right to see all column attributes

Using the information displayed, you can:• Identify variations of the same data content

You can use the Min string length and the Max string length values to determine the length of a specific column in a target table

• Obtain information on basic statistics, frequencies, ranges, and outliersThere are 21 distinct values in the Country column but you know that the company has customers in at least 25 countries. Based on this information you know the value for this column is incorrect. You may

Practice


want to go back to look at the source data to see if some customers have the wrong Country value.

• Identify nulls in the source systemNote that there are null values for region, postal code, and fax columns. Using this information you can decide what you want to do with null values before moving them to a target table. You may want to use the Validation transform in your job design to substitute the null value for a different value.

• Discover patterns and formatsYou can see that there are 20 different types of phone and fax patterns. Click on the value for the phone or fax column and you will note that there is different use of spaces, parenthesis and dashes. You may want to use the match_pattern function in the Validation transform to change different pattern types into specific required patterns dictated by your business rules.


Activity: Submitting a relationship profiler task

ObjectiveIn this activity you will submit a relationship profiler task to ensure that you have trusted information by:• Identifying and validating redundant data and relationships across the

ORDERS table from two different data sources

Instructions1 In the Formats tab of the Designer Local object library, right-click

Orders_FlatFile, and select Replicate. Note: It is easier to replicate Orders_FlatFile than to create a new file

format because this file format is already set correctly for the data files used in this activity.

The File Format Editor opens:

2 Under the General section, in the Name field, type ProfileOrders_FlatFile.

3 Under the Data File(s) section, in the Root directory field, double-click the folder icon to change the location of the file to point to where you saved the file orders011197A.txt.Note: You should have this file saved to your C:/Temp drive from the

previous workshop Reading multiple file formats.4 In the File name(s) field, double-click the folder icon to point it to

orders011197A.txt.

Practice


5 In the column attributes area, change the ORDERID column data type to varchar with a field size of 10.

6 Click Save & Close.7 In the Datastore tab of the Designer Local object library, expand

ODS_DS to display ODS_SALESORDER table.8 Right-click ODS_SALESORDER, select Submit Relationship Profile

Request > Relationship With.Note: Notice that the cursor icon changes to prepare you to select the

second source you want to create the relationship profile with.9 Go to the Formats tab, and click ProfileOrders_FlatFile.10 In the Submit Relationship Profile Request window, rename the task to:

Order Profile ODS_DS/FlatFile.

11 From the ProfileOrders_FlatFile table, click, and hold down the cursor on ORDERID.Note: The cursor turns into a hand icon.

12 Drag the hand cursor from ORDER ID to SALES_ORDER_NUMBER in ODS_SALESORDER to connect these two columns.

13 Click Save all columns data.This enables you to view all the columns in the tables profiled when you view the profiling results.

14 Click Submit.15 In the Profiler Monitor, click Refresh to verify the profiling task has

completed.16 Right-click ProfileOrders_FlatFile, and select View data.


17 Click to access the Relationship results:

You can see from the results that 81.63% of orders in ProfileOrders_FlatFiles do not match those in ODS_SALESORDER.

18 Click 81.63%.The actual order ID, number of records, and percentage of total this order ID makes up displays beside the visual relationship representation.

19 Click order number 10421 to drill down further.Note: The order number you drilled down on displays all instances of the

same order number, along with the remaining column values found in ProfilesOrders_FlatFile:

20 Click to display the View data tab.


21 Looking at the records, you can see that the order ORDERID values are both in integer and varchar data types:

22 Close the View Data window for ProfileOrders_FlatFile, and right-click ODS_SALESORDER to View Data.

23 Notice that the order numbers in ODS_SALESORDER are varchar data type. Based on the relationship profiler results and the business rules provided, you can now asses what to do with the ORDERID records that do not match.


Lesson Summary

Quiz: Profiling Data1 Which profiler task enables you to compare non-matching values

between two data sources from two different systems?

2 True or False. You can associate the Data Profiler repository to the same Job Server as the Designer.

3 List some of the tasks that you can accomplish when you submit a column profiler task using the Data Profiler?

4 List some of the tasks that you can accomplish when you submit a relationship profiler task using the Data Profiler?

5 Which Data Integrator tool do you use to view the status of all profiling tasks, and cancel or delete profiling tasks with its generated profile statistics?

Review


After completing this lesson, you are now able to:• Explain what data profiling is• Set up a Data Profiler repository• Add Data Profiler users• Configure Data Profiler tasks• Connect the Data Profiler Server to the Designer• Submit a profiling task and view generated profile information• Monitor profiling tasks in the Administrator

Summary


Lesson 15Managing Metadata

You can use the metadata reporting tool to assist you in better planning and managing your ETL jobs by viewing how existing data within your existing Data Integrator datastores is defined and by monitoring job execution performance.

In this lesson you will learn about:• Using metadata reporting• Using the metadata reporting tool

Duration: 1 hour


Using metadata reporting

Metadata provides a definition or description about data.

After completing this unit, you will be able to:• Explain what metadata is• Import and export metadata using the Metadata Exchange feature• Use reporting tables and views to analyze metadata information in Data

Integrator applications

Metadata is commonly distributed between a variety of information systems. This ability to exchange metadata allows information systems to communicate about changes in data.Metadata can describe different types of data, including:• Technical metadata: can be connection information and detailed

schemas• Business metadata: can represent business names• Warehouse data elements: can represent sources, transformations, and

targets• Warehouse processing elements: can be used for scheduling, status

reporting, and history recording

Support for metadata exchangeYou can exchange metadata between Data Integrator and third-party tools using XML files or the Metadata Exchange feature. Data Integrator supports two built-in metadata exchange formats:• AllFusion ERwin Data Modeler (ERwin) from Computer Associates to

export metadata for use with reporting tools like those available with the BusinessObjects 2000 BI Platform.

• CWM (Common Warehouse Metamodel)This model is a complete specification of the syntax and semantics needed to export and import shared warehouse metadata.

Data Integrator can also use:• The Meta Integration Model Bridge (MIMB)

MIMB is a Windows stand-alone utility that converts metadata models among design tool formats. By using MIMB with Data Integrator, you can exchange metadata with all formats that MIMB supports. This format is only available if MIMB is installed.

Introduction

Explaining what metadata is

Managing Metadata—Learner’s Guide 15-3

• BusinessObjects Universal Metadata BridgeThis tool converts Data Integrator repository metadata to Business Objects universe metadata. A universe is a layer of metadata used to translate physical metadata into logical metadata. For example, the physical column name dept no can be labeled Department Number according to a given universe design.Note: You must have the BusinessObjects Universal Metadata Bridge

installed on the same machine as BusinessObjects Designer and Data Integrator Designer. Once installed, you can access this option under the Designer Tool menu.

You can import metadata from any of the supported formats into a Data Integrator datastore.

To import metadata files using Metadata Exchange1 From the Designer Toolbar, click Tools, and then click Metadata

Exchange.2 In the Metadata Exchange window, click Import metadata from file.

3 In the Metadata format list, select the metadata format you want. 4 In the Source file name field, type or browse to the source file name.5 In the Target datastore name list, select the target datastore name.6 Click OK to complete the import.

You can also export Data Integrator metadata into a file that other tools can read.

Importing and exporting metadata using the Metadata Exchange feature


To export metadata from Data Integrator using Metadata Exchange1 From the Designer Toolbar, click Tools, and then click Metadata

Exchange.2 In the Metadata Exchange window, click Export Data Integrator

metadata to file. 3 In the Metadata format list, select the metadata format you want.

Note: You have additional options if are using an MIMB-supported format. If you are using MIMB-supported formats, make sure you select the Visual check box to open the MIMB application when completing the export process. If you do not select this option, the metadata is exported without opening the MIMB application.

4 In the Target file name field, type or browse to the target file name.5 In the Source datastore name list, select the source datastore name.6 Click OK to complete the export.

The Data Integrator repository is a database that stores your application components and the built-in Data Integrator design components and their properties. The open architecture of the repository allows metadata sharing with other enterprise tools.

Within your repository, Data Integrator populates a special set of reporting tables with metadata describing the objects in your repository. When you query these tables, you can perform analyzes on your Data Integrator applications.

Using these queries you can:• Analyze dependencies between Data Integrator objects.

For example, you can determine which jobs call a particular data flow, and in turn, which data flows are called by a particular job. With this kind of information, you can analyze the impact of changing or deleting an object.

• Determine which sources and columns are used to populate a given target column.With this kind of information, you can answer questions such as “Which targets are affected if source column Region is dropped from source Customer?”

• Determine the expressions used to populate columns in a target data warehouse or data mart.With this kind of information you can answer questions such as “What expression is used to populate column Unit_Cost in target Inventory?”Finding the information described in the example above usually requires you to work backwards through a complex and time consuming series of data transformations. With Data Integrator metadata reporting, a complex transformation sequence is represented by a single mathematical expression.

• Exchange metadata with another application.Some reporting tables contain information that can be used to exchange metadata with other applications, such as an OLAP or BI tool.

Using reporting tables and views to analyze metadata


Using the metadata reporting tool

This web-based Metadata Reports application provides several convenient options for analyzing dependencies, managing job execution performance, and producing documentation for your Data Integrator projects.After completing this unit you will be able to:• Access the Data Integrator Metadata Reports• View Impact and Lineage Analysis metadata reports• View Operational Dashboard metadata reports• Use Auto Documentation metadata reports

Using Data Integrator Metadata Reports you can view:• Impact and Lineage Analysis Reports: use these reports to easily

browse, analyze, and produce reports on the metadata for your Data Integrator jobs including other Business Objects applications associated with Data Integrator.

• Operational Dashboard Reports: use these reports to view graphical dashboards that allow you to evaluate your job execution performance.

• Auto Documentation Reports: use these reports to conveniently generate printed documentation to capture an overview of an entire ETL project and critical information for the objects you create in Data Integrator.

To use the Data Integrator metadata reporting tool, you must add the appropriate repository to the Administrator. Note: For this course you do not need to do this because you already added

the Data Integrator repository to the Administrator in Lesson 13.

Introduction

Accessing Data Integrator Metadata Reports


To access the Data Integrator Metadata Reports1 From the Start menu, point to Programs> BusinessObjects Data

Integrator 11.5 > Metadata Reports.In the Metadata Reports login page, enter the user name and password login information.Note: The default user name is admin and the default password is

admin. Only users with valid Administrator login privileges can log in to the Metadata Reports application.

The Metadata Reports homepage appears:

2 Click the appropriate icon to select one of the available categories.


Impact and Lineage Analysis metadata reports provide a simple, graphical, and intuitive way to view and navigate through various dependencies between objects in the Data Integrator repository. This analysis allows you to identify which objects will be affected if you change or remove other connected objects.

If you are running BusinessObjects Enterprise and related applications, you can also view Impact and Lineage Analysis reports for these application objects, such as universes. To do so, you must install, configure, and run the Metadata Integrator. For more information see “Adding Metadata Integrator”, Chapter 2 in the Data Integrator Metadata Reports User’s Guide.

To access Impact and Lineage Analysis metadata reports• From the Metadata Reports home page, click Impact & Lineage

Analysis.

The Impact and Lineage Analysis page contains two primary panes:• The left pane contains a hierarchy (tree) of objects. The top of the tree is

the default repository. • The right pane displays object content and context based on the different

objects that you select from the left pane.

The top level of the navigation tree displays the current repository. When you expand this tree hierarchy, you can see that the Data Integrator repository objects that you can analyze include the repository itself, datastores, tables and columns. Note: You can change the repository you want to analyze by clicking the

Settings option from the upper-right corner on the page.

In the left pane you can either search or select an object to analyze. For Data Integrator objects, use the Table and Column option from the drop-down list because you want to analyze Data Integrator repository datastores, tables and columns. This option is shown here:

Viewing Impact and Lineage Analysis metadata reports


By default, the objects listed to be analyzed include objects from other Business Objects applications such as, Universes and Business Views. The ability to analyze these different objects depends on the applications you have installed.

As you click each object in the tree hierarchy, the right-pane displays overview, lineage, or impact information that is associated with the object. The information displayed depends on the object that you analyze.

The table describes the associated content of the Overview tab of each Data Integrator object:Data Integrator:

:The next section explains the Impact and Lineage tabs for the table and column objects in more detail.

Object Associated content

Repository This tab displays the repository name, type, and version.

Datastore The Overview information varies depending on the datastore type.

For a datastore on Microsoft SQL Server, this tab displays the application type (eg. database), database type, database user name, database case sensitivity, datastore configuration, data-base version, database name, and server name.

Table This tab displays the table name, the datastore the table belongs to, the table owner name in the database, the table business-use name (if defined), the table type (eg. table or tem-plate table), and the date when Data Integrator last updated the table.

Column This tab displays the column name, table name, data type for the column, and whether the column contains nulls, primary key, and foreign key values.


Impact analysis tabImpact analysis allows you to look at the end-to-end impact of the selected source table or columns and the targets it affects. For example, impact analysis can help you answer questions such as, If I drop column A from table A, or I drop source table B, which targets will be affected?

In this example, you can see which target tables would be affected if you made changes to the SUPPLIERS source table:

Continuing with this example, you can see that Impact analysis at the COMPANYNAME column level for the SUPPLIERS source table shows which target tables are affected by the COMPANYNAME source column:

Note: Moving your cursor over an object gives you additional information related to the object, such as datastore, data flow, and database owner name.


Lineage analysis tabLineage analysis allows you to view where the information of a target table is coming from or see what makes up a specific column mapping for a target table. For example, lineage analysis can help you answer questions such as, Where does the data come from that populates table A or column B from in this target?

In this example, you can see that data for the SALES_FACT target table comes from the ODS_SALESITEM and ODS_SALESORDER tables:

In this example, you can see that the column mappings from the ORD_STATUS column in the SALES_FACT target table actually derive from two different columns:


Calculating column mappingsTo ensure that the Impact and Analysis metadata reports displays column mappings when you perform either an impact or a lineage analysis, you must ensure that you either:• Manually calculate column mappings in the Metadata Reports tool• Enable the Designer to automatically generate column mapping

information when saving data flows

You can use the Calculate Column Mapping option in the Impact and Lineage Analysis metadata reports to generate the latest column mappings in your repository.

To calculate column mappings in the Metadata Reports application1 In the top-right of the Impact & Lineage Analysis page, click Settings.2 In the Impact & Lineage Analysis window, click the Refresh Usage Data

tab.3 Select the related Job Server for the repository you are analyzing, and

click Calculate Column Mapping.

You can also enable the Designer to automatically generate this information when you save data flows in the Designer.

If you do not enable the Designer to automatically calculate column mappings, then you must manually select the Calculate Column Mappings option from the Designer before generating the Impact and Lineage Analysis reports. Doing so ensures that Data Integrator returns accurate mapping information.

For example, assume you did not enable the Designer to automatically calculate column mappings when saving data flows; however, you can see the current column mappings in the Metadata Reports tool because you the selected the Calculate Column Mappings option.

Then, working in the Designer, you unintentionally move an object in a data flow. When you try to execute the job, Data Integrator prompts you to save the changes to the data flow, and you save the changes.

Later, you use the Metadata Reports tool to view the column mapping information for the data flow, but the information is gone. This is because the modified data flow that you saved is now a new object in the repository and does not have column mapping information until you manually calculate it.

To enable the Designer to automatically generate column mapping1 From the Designer toolbar, click Tools, and then click Options. 2 In the Options window, expand Designer, and select General.

Note: If you do not see a (+) node beside Designer, click on any of the other options, and then click Designer again.

3 From the General options listed, select Calculate column mapping while saving data flow.


Activity: Using Impact and Lineage Analysis

ObjectiveIn this activity you will use the Impact and Lineage Analysis metadata reports category to evaluate how new data and business requirements may affect any of the existing ETL jobs you have created.

Scenario

You are given new customer information and business requirements that require you to make modifications to the CUSTOMER source table that resides in the ODS database. Before you make any changes, you want to make sure that you know what targets will be affected by these changes.

Instructions1 From the Start menu, point to Programs > BusinessObjects Data

Integrator 11.5 > Metadata Reports.2 In the Metadata Reports home page, click Impact & Lineage Analysis.3 In the left pane, from the Select objects to analyze list, select Table and

column.4 Under the Datastores node, expand the ODS_DS datastore.

Note: This is the datastore that connects to the ODS database.5 Click the ODS_CUSTOMER table.

The Overview, Impact, and Lineage tabs appear in the right-pane.6 Click the Impact tab.

You can see that any changes you make to the ODS_CUSTOMER source table will affect the CUST_DIM, R1, R123, R2, and R3 target tables:

Now you want to know the associated data flows that are used to load data into the targets displayed.

Practice


7 Move the cursor over the CUST_DIM target.A tool tip appears providing information on the data flow, datastore, and table owner name. You can see that target CUST_DIM is dependent on CustDim_DF:

8 Move the cursor over the R1 target.You can see from the information displayed in the tool tip that target R1 is dependent on Case_DF:

9 Move the cursor over the remaining R123, R2, and R3 targets.You should see that these targets are all dependent on Case_DF.After doing an impact analysis on the ODS_CUSTOMER source table, you are now:• Aware that modification to this source table means that you also

have to look at both the targets and data flows displayed in the analysis

• Able to determine if you have to make changes to the associated targets and data flows


Operational dashboard reports provide graphical representations of Data Integrator job execution statistics and duration over a period of time. Using this information you can streamline and monitor your job scheduling and management for maximizing overall job efficiency and performance.

To access operational dashboard metadata reports• From the Metadata Reports home page, click Dashboards.

The dashboard summary page displays the different dashboards available.

Each dashboard category contains a current (snapshot) report and a historical (trend) report.

The two dashboards in the top row provide a current (snapshot) for the last 24 hours, and the two dashboards on the bottom row display trends over the last 7 days.

There are two categories of dashboards.• Displayed on the left-hand side are Job execution statistics dashboards.

Job execution statistics dashboards show you how many jobs succeeded or had errors.

• Displayed on the right-hand side are Job execution duration dashboards.Job execution duration dashboards show you how long it took the jobs to run and whether the run times are within acceptable limits.

For Job execution statistics, the dashboard reports are represented as:• Current (snapshot) pie chart• Historical (trend) bar chart

Viewing operational dashboard metadata reports


For Job execution duration, the dashboard reports are represented as:• Current (snapshot) speedometer• Historical (trend) line chart

Clicking on each dashboard allows you to drill down for more detailed information.

Job execution statisticsJob execution statistics dashboards are on the left side of the dashboards summary page. The color codes on these apply to the status of the job’s execution:• Succeeded (green)• Warnings (orange): applies to one or more warnings• Errors (red): applies to one or more errors• Running (blue)

Current (snapshot) pie chart - last 24 hoursThe pie chart displays status information for jobs run in the time period displayed. The chart identifies the number and ratio of jobs that succeeded, had one or more warnings, had one or more errors, or are still currently running:


To view a current snapshot for job execution statistics1 From the Dashboard Summarpy page, click on the pie portion or pie slice

you want more information on.A tabular list of jobs displays for each job execution statistics status

2 Click on the desired status to view more detailed information. Each Job Execution Statistics table includes: • Repository name: the repository associated to this job• Job name: the name of the job in the Designer• Start time and End time: the start and end timestamps in the format

hh:mm:ss• Execution time: the time it took the job to run.• System configuration: the name of the system configuration that

applies to that job3 Click the Operational tab or the link in the navigation path at the top of the

window to return to the dashboard summary page.

Historical (trend) bar chart - last 7 daysThe Job Execution Statistic History bar chart displays how many jobs succeeded, had warnings, failed, or are still currently running on each of the last 7 days.

As with the Current (snapshot) pie chart, you can click on the individual bars to drill into the report to display the Job Execution Statistics tables.


Job execution durationJob execution duration dashboards are on the right side of the dashboards summary page. The color codes on these two charts illustrate the time alotted for job execution, the tolerance for jobs that go beyond the alotted time, and the time when the job execution exceeds the tolerance and becomes critical: • Normal (green): All jobs executed within the job execution time window.

This window is the amount of time that you alotted to run your jobs. The number displayed at the high end of the green range is the Job execution time window setting in number of hours. You can set job execution times to be during non-peak hours to ensure that your target data warehouse is available to applications and business users during peak hours. The size of the green zone changes based on the Job execution time window setting.

• Warning (orange): At least one job finished executing during the tolerance period specified. Tolerance is the amount of time that jobs can run beyond your normal Job execution time window. The size of the orange zone changes based on the Job execution time window tolerance setting.

• Critical (red): At least one job finished executing beyond the normal or warning (tolerance) limits. The size of the red zone is fixed.

Note: Changing dashboard settings is covered at the end of this section.

Current (snapshot) speedometer - last 24 hoursThe needle on the dashboard indicates the actual time elapsed from the start of the first job to the end of the last job. For example, 8/23/05 2:00 AM - 8/23/05 1:38 PM is the time elapsed from the start of the first job to the end of the last job.

You can mouse over this range to show tool tips that display the Job Execution Duration (total elapsed time). This same value also displays below the needle hub in decimal form. For example, jobs J1, J2, and J3 are scheduled to run in parallel. J1 starts first and J3 finishes last. The elapsed time displayed reflects the end time of J3 minus the start time of J1.


Looking at the different zones for the speedometer in this Job Execution Duration example , you can see that:

• If the needle points within the green colored range, all the jobs have been executed in the allotted time frame. For example, if your start time is 2:00 AM and the number displays 7, your preferred time window to execute all jobs is from 2:00 AM to 9:00 AM.

• If the needle points within the orange colored range, some jobs have been executed within the tolerance time allotment. For example, if the high end of the green zone is 7 and the orange zone denotes 8, your tolerance for all jobs to execute is one additional hour.

• If the needle points in the red colored range, some jobs have been executed beyond the normal or tolerance ranges. For example, the high end of this zone is 24 hours.

The needle hub also changes color to reflect in which range the needle points to. Note: This dashboard does not indicate whether a job succeeded or had

errors; it only displays execution duration.

You can drill down into the speedometer to view a Gantt chart that displays all the jobs that executed successfully and when it finished executing.


To view a current snapshot for job execution duration1 From the Dashboard Summary page, click the speedometer to drill down.

The Gantt chart displays the different start time, end time, and color status for the executed jobs:

The first (orange) line at 9:00 denotes the high end of the Job execution time window setting. The second (red) line at 10:00 indicates the high end of the Job execution time window tolerance setting.Note: You can mouse over the Gantt bars to view a tool tip window with

the total elapsed time for Job Execution Duration.2 Click one of the job names on the chart.

The Dataflow Execution Duration graph displays:

Note: The Table tab displays different information depending on which level it is clicked on. For example, if you click on the table tab at the job level, you can see table information such as, repository name, job name, start time, end time, execution time, status, and system configuration


If available, you can hover over the data flow name to drill down further to display more information, such as audit statistics:

Historical (trend) line chart - last 7 daysThe Job Execution Duration History line chart shows the trend of job execution performance over the last 7 days. The vertical axis is in hours, the horizontal axis shows the date, and the color panels illustrate the limits set for the job execution period:

Each point denotes the total elapsed time for the job execution duration of all the jobs for that day.

Clicking the marker points on the line chart allows you to drill down to display the Job Execution Duration values for each day on a Gantt chart as described in the previous section Current (snapshot) speedometer.

Similar to the speedometer dashboard, by clicking on the Gantt bar you can drill into the job for data flow execution duration times and more information, such as audit. Note: The values displayed on the tool tips should be the same when you

hover the cursor over the Today marker on this line chart and when you hover the cursor over the speedometer.


Changing dashboard display settingsYou can also change the dashboard settings to specify which repository you want to view these dashboards from, and also change the displayed execution start, end times and tolerance.

The Monitor, Administrator, and Multi-User Administrator user roles can all change display settings.

To change dashboard display settings1 In the top-right corner of the dashboard summary page, click Settings.

The Dashboard Settings window appears:

2 Change the settings by selecting or typing the values in the fields listed, and then click Apply.

The dashboard settings are:• Select repository: select all or a specific repository from the list.• Job execution start time: for Job Execution Duration and Job Execution

Duration History reports, enter the start of the job execution window in the format HH:MM (from 00:00 to 23:59).

• Job execution time window: for Job Execution Duration and Job Execution Duration History reports, enter the desired job execution time frame in number of hours (numeric value from 0 to 24).

• Job execution time window tolerance: for Job Execution Duration and Job Execution Duration History reports, enter the number of hours to add to the Job execution time window.

• This field accepts numeric values from 0 to 24. The total of the Job execution time window parameter and the Job execution time window tolerance cannot exceed 24.Note: Setting the job execution time window tolerance is optional. When

set the Job execution time window tolerance displays as a warning (orange) area (on the speedometer and line charts) or as a red line limit (on Gantt charts).


Auto documentation is a convenient and comprehensive way to create documentation for all of the objects you create in a Data Integrator project.

After creating a project, you can use Auto documentation reports to quickly create a PDF or Microsoft Word file that captures a selection of job, work flow, and/or data flow information including graphical representations and key mapping details.

Auto documentation reports provide information on:• Object properties: applies to work flows, data flows and R/3 data flows• Variables and parameters: applies to jobs, work flows, data flows, and

R/3 data flows• Table usage: shows all tables used in an object and its child objects. For

example for a data flow, table usage information includes datastores and the source and target tables for each.

• Thumbnails: this is an image that reflects the selected object with respect to all other objects in the parent container and applies to all objects except for jobs.

• Mapping tree: applies to data flows

Using Auto Documentation metadata reports


To access Auto documentation metadata reports1 From the Metadata Reports home page, click Auto Documentation.

The Auto Documentation page displays two primary panes:

• The left pane shows a hierarchy (tree) of objects. • The right pane shows object content and context.

2 Expand the tree to select an object to analyze. When you select an object, the object’s pertinent information displays in the right pane.

3 In the right pane, select each tab to display desired information. Note: Tabs vary depending on the object you are exploring.

Viewing object informationYou can navigate between objects by selecting the objects in the navigation tree. The hierarchy of the tree matches the hierarchy of objects created in Data Integrator Designer: • Repository• Project• Job• Work flow• Data flow

Repository informationClicking on a repository name in the navigation tree displays the Overview tab in the right pane. This information includes the repository name, type and version.


Project informationClicking on a project name in the navigation tree displays the Project tab in the right pane and displays the list of jobs in the project.

Job informationYou can click on a job in a project to drill into the hierarchy to display the job name and Table usage tabs.

The job name tab displays a graphical representation of the objects that make up the job, including any work flows, data flows, variables, scripts and parameters contained within it:

Note: From the bottom right, click to Show or Hide the Variables and Parameters header panel.

The table usage tab displays the datastores and associated tables contained in the selected object.

Work flow informationClicking on a work flow name displays the work flow name and Table Usage tabs.

The work flow name tab includes a thumbnail that represents the selected object with respect to all other objects in the parent container. For example, if this work flow is the only one in the job, it appears alone; but if there are two other work flows in the job, these will also appear in the thumbnail image.

Clicking on the work flow thumbnail displays graphical image of the next level of objects, such as one or many data flows, contained within the work flow.


You can also click to Hide/Show object properties, such as Execute only once or Recover as a unit, and Variables and Parameters, if applicable.

Note: When you click on the data flow graphical interface, the data flow details are displayed and an additional Mapping tab appears in between the work flow name and Table Usage tabs. You can drill down further into the objects that make up the data flow displayed at this level. The next section discusses graphical displays at the data flow level.

The table usage tab lists the datastores and associated tables contained in the selected object, if applicable.

Data flow informationClicking on a data flow name displays the data flow name, Mapping tree, and Table usage tabs:

The Mapping tree tab displays the transform mappings within the data flow and the Table usage tab displays the source, transforms, and targets used.

You can also click on each data flow object to view:


• A table that displays a thumbnail of the source table with respect to the other objects in the data flow and table properties:

These table properties include Optimizer hints, such as caching and join rank settings, and Tables basics, such as, datastore and table name.

• Transform mappings in a transform:

• Target table basics such as datastore and owner name, and Loader properties such as autoload options:


Generating Auto Documentation for an objectYou can generate and print Auto Documentation metadata reports at the job, work flow, or data flow level to an Adobe PDF or Microsoft Word format. Printing at each level displays different objects.

Printing at the job level displays:• A graphical representation of the job• Variables and parameters, if applicable• A graphical representation of the data flow within the job• Transform mappings• Table usage information

Printing at the work flow level displays: • A graphical representation of the work flow components• A graphical representation of the data flow within the work flow• Transform mappings• Table usage information

Printing at the data flow level displays:• A graphical representation of the data flow within the work flow• Transform mappings

To generate Auto Documentation metadata reports1 In the right pane of the Auto Documentation window, beside the object

name header, click .The Auto Documentation print window displays:

2 Clear any options you do not want to appear in the print output.3 Select a print output format.4 Click Print.5 In the Windows File Download window, click Save to save the document

to the appropriate location.6 After saving or printing the report, click Close to close the Print window.


Changing auto documentation display settingsYou can also change the auto documentation settings to specify from which repository you want to view auto documentation.

To change dashboard display settings1 In the top-right corner of the Auto documentation homepage, click

Settings.The Auto Documentation settings window appears:

2 Select the repository from which you want to view auto documentation.3 Click Apply.


Activity: Creating Auto Documentation

ObjectiveIn this activity you will:• Use Auto Documentation to view components that make up CDC_Job• Create an Auto Documentation output to an Adobe PDF format to send to

another developer who will use it for ETL design purposes

Instructions

Using Auto documentation1 From the Designer toolbar, click Tools, and then click Metadata Reports.2 Click Auto Documentation.3 From the navigation tree, browse to CDC_Job.4 Expand CDC_Job.

The objects and variables that make up CDC_Job appear:

5 Double-click the script to view the script content.6 From the navigation tree, click CDC_Job to get back to the job view level.7 Click Table usage to view the source, transforms, and target used in this

job.8 From the navigation tree, click CDC_DF.

Notice the Mapping tab now appears on top with the data flow name and Table usage tabs.

9 Click CDC_DF to display the objects that define the data flow.

Practice


10 Click the Mapping tab to display the column mappings used in CDC_DF:

Generating Auto Documentation metadata reports1 From the navigation tree, click CDC_Job.2 In the right pane of the Auto Documentation window, beside CDC_Job,

click .3 In the Auto documentation window, leave the default PDF option, and

click Print.4 In the File Download window, click Save.5 Name this document CDC_Job_Autodoc, and select a destination to

save it to.


Lesson Summary

Quiz: Managing Metadata1 Give some examples of types of data that metadata can describe.

2 What kind of information can you find in the operational dashboard category of the Metadata Reports tool?

3 What are some supported formats for metadata exchange?

4 What kind of analysis are you doing when you use the Impact & Lineage analysis metadata reports category to view source column or table information?

5 What kind of analysis are you doing when you use the Impact & Lineage analysis metadata reports category to view target table information?

6 What is the difference between profiling data and managing metadata?

Review


After completing this lesson, you are now able to:• Explain what metadata is• Import and export metadata using the Metadata Exchange feature• Use reporting tables and views to analyze metadata information in Data

Integrator applications• Access the Data Integrator Metadata Reports• View Impact and Lineage Analysis metadata reports• View Operational Dashboard metadata reports• Use Auto Documentation metadata reports

Summary


Workshop

Building MySales warehouse

Duration: 2.5 to 3 hrs

ObjectiveIn this activity you put together all the concepts you have learned to:• Create a few dimension tables and a fact table to load into a mini-

warehouse for reporting• Use the Data Integrator metadata reports tool to analyze the data in your

mini-warehouse

ScenarioYou identify some data sets that you need to work with to load a mini-warehouse to support a specific sales process within your organization.

After analyzing the data sets, you know you have to:• Populate the DIM_TIME in your data warehouse with time values

specified by the business requirements.• Populate Dim_Customer _DF with the necessary columns specified by

the business requirements.• Populate DIM_PRODUCT with the necessary fields specified by the

business requirements. You also are going to calculate the value of each product in inventory.

• Create a fact table that measures total order revenue. To create this table, you will have to use multiple transforms to map and validate data.

• Use the metadata reporting tool to analyze the data in MySalesDW.You also know that you:• Need to create several data flows and use these to create a sequential

data movement job. • Are working in a test environment so you want to be able to define each

data flow and separately test each data flow within a job. • Have to test the data flows to make sure they work correctly before you

combine the data flows to execute as one job.

Practice


InstructionsThis activity is broken down into five parts:1 Populating the Time Dimension2 Populating the Customer Dimension3 Populating the Product Dimension4 Creating a SalesOrder Fact Table5 Adding an aggregate table for sales by region by country

Data setup1 Create two databases in MS SQL Enterprise Manager:

• MySalesDW_Source• MySalesDWNote: Use the default user name and password sa.

2 From the resource CD, copy or save Restore MySalesDW_Source.bak and MySalesDW.bak files to your computer, and restore the databases.

To restore the database1 From the navigation tree in SQL Server Enterprise Manager, right-click

MySalesDW_Source > All Tasks > Restore Database.In the Restore database window, beside the Restore as database field, ensure MySalesDW_Source is selected.

2 In the Restore options area, click From device.3 Under the Parameters section, click Select Devices.4 In the Choose Restore Devices window, click Add. 5 In the Choose Restore Destination window, click the ellipses button to

browse to the location for MySalesDW_Source on your computer, and click OK.

6 Click OK twice to exit the Choose Restore Devices and Restore database windows.You should see a message displaying the process as the database is restored. Note: If you are unable to restore the files directly because you get a

message notifying you that the database already exists: On the Options tab, select the Force restore over existing data option.

7 Click OK when you get a message that the database is successfully restored.

8 Repeat step 1 to step 7 to restore MySalesDW.


Project setup1 Log into the Designer using direpo as your database and direpouser as

your user name.2 Create a datastore for MySalesDW_Source and MySalesDW.3 Import these tables from the MySalesDW_Source datastore:

• CUSTOMERS• ORDERS• ORDERS_DETAILS• PRODUCTS

4 Import these tables from MySalesDW datastore:• AGG_REGION_SALES• BAD_ORDERS• DIM_CUSTOMER• DIM_EMPLOYEE• DIM_PRODUCT• DIM_TIME• FACT_SALES_ORDER

5 Create a job:• LoadMySalesDW_Job

6 Create four data flows separately from LoadMySalesDW_Job for now.Remember that in a testing environment you want to be able to define each data flow and separately test each data flow within a job. Once you have tested your data flows to make sure they work correctly, you can then combine the sequential data flows to execute as one job.• Dim_Time_DF• Dim_Customer_DF• Dim_Product_DF• SalesOrderFact_DF

7 Change the view data sampling size in the Designer to display the correct row count needed for these activities. Go to Tools->Options. Expand the Designer node, click General and in the View data sampling size (rows) field change 1000 to 3000.


Populating the Time Dimension1 Open Dim_Time_DF.2 Use the Date_Generation transform to populate the DIM_Time in

MySales_DW that holds daily increment values for the time range 1996.01.01 to 1999.01.01.

3 Add the Query transform.4 Connect your source, Query, and target.5 In the Query Editor, map DI_generated_date to the Date column in the

Schema Out pane.6 Map the remaining columns with the built in Data Integrator date

functions using this table:

Tip: Highlight each target column and use the Functions button in the Query Editor Mapping tab to access the built-in date functions.

7 Validate the Dim_Time_DF.8 Open LoadMySalesDW_Job and from the Local object library, drag

Dim_Time_DF into the workspace.9 Execute LoadMySalesDW_Job with Dim_Time_DF.

Note: Your job should generate errors when executing. Click the error icon to see this error message:

The remaining errors in the log are caused by this duplicate key conflict. Depending on your application you can correct this error using one of two target table options:• Delete data from table before loading: Use this option if you want to

fully update and not keep any existing rows in your target table.• Auto-correct load: Use this option if you want to add to the existing

rows in your target table. The auto-correct load option updates the primary key in your target table.

10 For the purpose of these activities, you want to update the entire target table and not keep any existing target rows. Modify your target table to reflect this.

11 Validate and execute LoadMySalesDW_Job with Dim_Time_DF again.Your job should execute successfully without any errors.

12 After your job executes successfully, go back to the job level in the workspace, right-click Dim_Time_DF and delete it.

Target Column Mapping

Date Actual date (you already mapped this in step 4)

Year Year of actual date above

Quarter Quarter in which the actual date above occurs in

Month Month of the actual date above

Week Week in year in which the actual date above occurs in


Note: This does not delete Dim_Time_DF from the local object library. Instead you are removing it from LoadMySalesDW_Job.

Questions1 How many rows are in your target table?2 Why did you get the Cannot insert duplicate key in object error?

Solution1 The target record count should be 1097.2 This occurs because you cannot insert a primary key into a table where

the same primary key already exists. • Your Dim_Time_DF should look like this:


Populating the Customer Dimension1 Open Dim_Customer_DF.2 Use the CUSTOMERS table as your source and the DIM_CUSTOMER

table as your target.3 Do a straight mapping for these columns:

• CUSTOMERID• ADDRESS• CITY• POSTALCODE• COUNTRY

4 Map the COMPANYNAME column, but transform it to display in uppercase only.Tip: Use a built-in string function to accomplish this.

5 Validate the region column with this validation rule:Pass all the rows with a valid region. If this rule fails, then you want to replace NULL regions with Region Unknown.Note: Because you want to pass all rows there is no need to send failed

rows to any targets.6 Modify your target table so that it deletes all data before loading.7 Validate Dim_Customer_DF.8 Open LoadMySalesDW_Job, drag Dim_Customer_DF into the

workspace, and execute the job. 9 After your job executes successfully, go back to the job level in the

workspace, and delete Dim_Customer_DF.

Solution• The target record count should be 91.• Look in the DIM_CUSTOMER target and notice that:

• The company name is now displayed in uppercase.• Valid regions are displayed and the three Null regions have been

replaced with Region Unknown.• Your data flow should look similar to this:


Populating the Product Dimension1 Open Dim_Product_DF.2 Use PRODUCTS as your source table and DIM_PRODUCT as your

target.3 Do a straight mapping for these columns:

• PRODUCTID• PRODUCTNAME• UNIT PRICE• DISCONTINUED• CATEGORYID to CATEGORY

4 Calculate the value for each product in the inventory.Tip: Use the Smart Editor in the mapping tab to map your calculation. For

your calculation use UNITPRICE and multiply it by UNITSINSTOCK.5 Modify your target table so that it deletes all data before loading.6 Validate and execute the job.7 Once your job executes successfully, go back to the job level in the

workspace, right-click Dim_Product_DF, and delete it.

Solution

The target record count should be 78.

Your data flow should look similar to this:


Creating the SalesOrder fact table1 Use ORDERS and ORDERS_DETAILS as your source tables.2 Use the Query transform to combine the columns from both sources into

the FACT_SALES_ORDER target table.3 Do a straight mapping for these columns:

• ORDERID• CUSTOMERID• ORDERDATE• PRODUCTID• UNITPRICE• QUANTITY

4 Use the Smart Editor in the Query Mapping tab to calculate the total revenue for the TOTAL_ORDER_VALUE column.Tip: The DISCOUNT column in the ORDER DETAILS table is a

percentage. The mapping function is provided at the end of this exercise but challenge yourself and try to do this on your own.

5 Propose a join in the Query transform WHERE clause.Notice the DW_Key column in the Schema Out pane of the Query Editor.DIM_EMPLOYEE is an SCD Type 2 table, and DW_Key acts as a surrogate key. Generally, employee information such as address and phone # are very susceptible to change. In an SCD Type 2 table, you preserve history by keeping the old row and adding another row for the new data. As a result, you have duplicate employee ID records. Because EMPLOYEEID is no longer unique, DW_KEY is added to the DIM_EMPLOYEE table to use as the primary key.

6 Retrieve the DW_KEY values from the DIM_EMPLOYEE table to load into your SalesOrder fact table. Compare these values to return the corresponding DW_KEY output:• EMPLOYEEID from DIM_EMPLOYEE • EMPLOYEEID from the ORDERSTip: Use the Lookup_ext function to map and retrieve the DW_KEY

from the DIM_EMPLOYEE table.The lookup_ext function is provided at the end of this exercise, but challenge yourself further and try it on your own

7 Delete the connecting line between the Query and the target.


8 Validate the combined CUSTOMERID and TOTAL_ORDER_VALUE columns using the rules below before loading the data into the target table:

9 Add the BAD_ORDERS table and make it a target. Do not connect the Validation transform to the BAD_ORDERS table yet because BAD_ORDERS table contains only a subset of the columns.

10 Do a straight mapping for the columns below before you send the failed rows to the BAD_ORDERS table:• ORDERID• DI_ERRORCOLUMNS • DI_ERRORACTION This data flow design should have:• The ORDERS and ORDERS DETAILS source tables. • A transform used for combining and mapping the source tables into

the FACT_SALES_ORDER table.• A transform used to validate the rules provided for the combined

columns.• A transform used for mapping the failed ORDERID rows to the

BAD_ORDERS table.A graphical representation is also available at the end of this activity but challenge yourself first. See if you can figure this activity out!

11 Modify your target table so that you delete all data before loading it.12 Validate and execute the job.13 After your job executes successfully, go back to the work flow level in the

workspace, right-click SalesOrder_Fact, and delete it.

Solution• The target row count for FACT_SALES_ORDER should be 2155.• The target row count for BAD_ORDERS should be one.• TOTAL_ORDER_VALUE calculation:

(("ORDER DETAILS".QUANTITY * "ORDER DETAILS".UNITPRICE) - (("ORDER DETAILS".QUANTITY * "ORDER DETAILS".UNITPRICE) * "ORDER DETAILS".DISCOUNT))

• DW_KEY mapping function:lookup_ext([MySalesDW.DBO.DIM_EMPLOYEE,'PRE_LOAD_CACHE','MAX'], [DW_KEY],[NULL],[EMPLOYEEID,'=',ORDERS.EMPLOYEEID]

Validation RuleAffected Column

Action on Failure —If this rule fails you want to:

You want to ensure that the CUSTOMERID exists in the DIM_CUSTOMER table in the data warehouse.

CUSTOMERID Send it to Fail

You want to ensure that all ORDERTOTALS are greater than 0.

TOTAL_ORDER_VALUE

Send it to Fail


• Your data flow should look like this:

Alternative method: Adding function mappings in the Query Editor Schema Out pane

You can also map the Lookup_ext function by adding a function column for DW_KEY in the Schema Out pane of the Query Editor.

Mapping the function like this allows you to right-click and modify/view the function call without having to define the parameters for the function again.

To add a function column for DW_KEY in the Schema Out pane1 Right-click DW_KEY, click New Function Call, and select Insert Below.2 Select the Lookup_ext function.3 Complete the Lookup_ext as you did before, but notice that in the Output

Parameters, there is a new DW_KEY_1 under the Column in Query field. 4 Click Finish.

Note the DW_KEY_1 in the Query Editor Schema Out pane:

5 Right-click lookup_ext, and select Modify Function call.6 In the Output Parameters section, under Column in Query, edit this field

so that it reads DW_KEY.


Your Schema Out pane should look like this:

Connecting the data flows for LoadMySalesDW_Job

Now that you have built all the data flows for loading MySales data warehouse, you may want to put all the data flows in LoadMySalesDW_Job and execute the data flows sequentially:

• The target row count for FACT_SALES_ORDER (this is the last target tables in your job) should still be 2155.

• The target row count for BAD_ORDERS (this is one of the last target tables in your job) should still be one.

A solution file, Building_MySales_Warehouse_solution.atl, is also included in your resource CD. Import this .atl file into the Designer to view the actual job, data flow and transform definitions.

To import this .atl solution file, right-click in your object library, select repository, and click Import from File. Browse to the resource CD to locate the file and open it.Note: Do not execute the solution job as this may override the results in your



Adding an aggregate table for sales by region by countryYou are going to create an aggregate table that you can later add to the warehouse you just built. This is a sales aggregate table grouped by region and by country.1 Create an AGG_Sales_Region_DF.2 Use FACT_SALES_ORDER and DIM_CUSTOMER as your source

tables.3 Map these columns to AGG_REGION_SALES:

• Region• Country

4 Map SALES with the Sum of the total order values.Tip: Use the built-in Aggregate functions and define a group by

statement.5 Propose a join in the Query transform WHERE clause.6 Validate and execute the job separately.7 Now go to the Metadata Reporting tool and using the Impact and

Lineage tool, look at the Lineage analysis for AGG_REGION_SALES table.

Solution• The target row count should be 24.

A solution file, AGG_Sales_Region_DF_solution.atl, is also included in your resource CD. Import this .atl file into the Designer to view the data flow and transform definitions. Note: Do not execute the solution job as this may override the results in your


A-2 Data Integrator XI R1/R2: Extracting, Transforming, and Loading Data—Learner’s Guide

Lesson 1

Quiz: Data Warehousing Concepts1 Explain the difference between a recursive hierarchy and a vertically

flattened hierarchyAnswer:Recursive: each record points back to its parent. This allows the structure of the hierarchy to flex without changing the table schema.Vertically flattened: all the data you need about the relationship between any two related nodes in the hierarchy is provided in a single row with no lookups required.

2 What is dimensional modelling?Answer:It is a logical data design technique that aims to represent data in an intuitive standard framework that allows for high performance access.

3 What is an SCD? List some of the advantages of Type 2 SCD. When might Type 1 be the best solution?Answer:Slowly changing dimensions.Type 2 advantages: adds a record, and you can accommodate multiple changes. Unlimited history preservationType 1 the best solution: simple change like a name change where there is no need to preserve the history.

Answer Key—Learner’s Guide A-3

Lesson 2

Quiz: Understanding Data Integrator1 List two benefits of Data Integrator.

Answer:Single infrastructure for batch and real-time data movement, intelligent caching, prepackaged data solutions.

2 Which of these objects is Single-Use?• Project• Job• Work flow• Data flowAnswer:Project

3 What is the difference between a repository and a datastore?Answer:The repository is a set of tables that hold system objects, source and target metadata, and transformation rules. A datastore is an actual connection to a database that holds data.

4 Place these objects in order by their hierarchy: Data Flows, Job, Project, and Work Flows.Answer:Projects, Jobs, Work Flows, Data Flows.


Lesson 3

Quiz: Defining Source and Target Metadata1 What is the difference between a datastore and a database?

Answer: A datastore is a connection to a database.2 List five kinds of metadata.

Answer: Table name, attributes, indexes, column names, descriptions, data types, and primary key columns.

3 What are the two methods in which metadata can be manipulated in Data Integrator objects? What does each of these do?Answer: You can use an object’s options and properties settings to manipulate Data Integrator objects. Options are actions that can be done to an object the properties are settings within the object.

4 Which of the following is NOT a datastore type?• Database• File Format• Application• AdapterAnswer: File Format


Lesson 4

Quiz: Creating a Batch Job1 Does a data flow have to be part of a job?

Answer:Yes. A data flow needs to be contained within a job so that it can be executed.

2 How do you add a new template table?Answer:Drag the template table from the datastores tab in the Local object library onto your data flow.

3 Name some of the objects contained within a projectAnswer:Examples of objects are: jobs, work flows, and data flows.

4 What is a conditional? Would it appear in a data flow, or in a work flow? Answer:A conditional is a single-use object used to implement if/then/else logic in a work flow.

5 Explain the concept of a try/catch block. How do they fit into a Data Integrator work flow?Answer:A try/catch block is a combination of one try object and one or more catch objects that allow you to specify alternative work flows if errors during job execution.


Lesson 5

Quiz: Validating, Tracing, and Debugging Batch Jobs1 What must be running in order to execute a job immediately?

Answer:Designer, and the Job server.

2 List some reasons why a job might fail to execute. Answer:Incorrect syntax, Job Server not running, port numbers Designer and Job Server not matching.

3 List some ways for ensuring that the job executed correctly.Answer:Examine the log files, check the data with an RDBMS tool.

4 Explain the View Data option. Is this data permanently stored by Data Integrator?Answer:View data allows you look at the data for a source or target file. The data under the view data feature gets deleted after each job execution session.

5 List and explain the conditions that determine how many rows of data will be visible in the View data pane. Answer:Size of query, sample size setting, filtering.


Lesson 6

Quiz: Using Built-in Transforms and Nested Data1 What is the Case transform used for?

Answer:The Case transform provides case logic based on row values. You use the Case transform to simplify branch logic in data flows by consolidating case or decision making logic into one transform.

2 Name the transform that you would use to combine incoming data sets to produce a single output data set with the same schema as the input data sets.Answer:The Merge transform

3 A validation rule consists of a condition and an action on failure. When can you use the action on failure options in the validation rule?Answer:You can use the action on failure option only if:• The column value failed the validation rule• Send to Pass or Send to both options are selected

4 Which transform do you use to unnest nested data source for loading into a target table?Answer:The Query transform

5 When would you use the XML_Pipeline transform?Answer:Use the XML_Pipeline transform when you want to process large XML files one instance of a specified repeatable structure at a time.


Lesson 7

Quiz: Using Built-in Functions1 Describe the differences between a function and a transform?

Answer:Functions operate on single values, such as values in specific columns in a data set.Transforms operate on data sets, creating, updating, and deleting rows of data.

2 Why are functions used in expressions?Answer:Functions can be used in expressions to map return values as new output columns. Adding output columns allows columns that are not in an input data set to be specified in an output data set.

3 What does a lookup function do?Answer:All lookup functions return one row for each row in the source. One difference between them is how they choose which of several matching rows to return.

4 What value would the Lookup_ext function return if multiple matching records were found on the translate table?• Depends on Return Policy (Min or Max)• An arbitrary matching row• Closest value less than or equal to value from sequence column• #MULTIVALUE error for records with multiple matchesAnswer:Depends on Return Policy (Min or Max)


Lesson 8

Quiz: Using Data Integrator Scripting Language and Variables1 Explain the differences between a variable and a parameter.

Answer:Parameters are expressions that pass to a work flow, data flow or custom function when they are called in a job. A variable is a symbolic placeholder for values.

2 When would you use a global variable instead of a local variable?Answer:• When the variable will need to be used multiple times within a job. • When you want to reduce the development time required for passing

values between job components.• When you need to create a dependency between job level global

variable name and job components.3 What is the recommended naming convention for variables in Data

Integrator?Answer:Variable names must be preceded by a dollar sign ($). Local variables start with $; while global variables can be denoted by $GV.


Lesson 9

Quiz: Capturing Changes in Data1 What are the two most important reasons for using CDC?

Answer:Improving performance and preserving history are the most important reasons for using CDC.

2 What method of CDC is preferred for the performance gain of extracting the least rows?Answer:Source-based CDC is the preferred method for the performance gain of extracting the least rows.

3 What is an initial load?Answer:The first time you execute the batch job, the Data Integrator Designer uses the initial load to create the data tables and retrieve data from the first row - where N is the number of rows defined to be executed in the data flow at once. By default, the initial load runs the first time you execute the job.

4 What is a delta load?Answer:After creating tables with an initial load, the delta work flow incrementally loads the next number of rows specified in the data flow. The delta load retrieves only data that has been changed or added since the last load iteration. When you execute your job, the delta load may run several times, loading data from the specified number of rows each time until all new data has been written to the target database.

5 What type of slowly changing dimensions is this combination of transforms used for? Table Comparison, Key Generation, and History Preserving.Answer:Type II SCD


Lesson 10

Quiz: Handling Errors and Auditing1 List the different strategies you can use to avoid duplicate rows of data

when re-loading a job.Answer:• Using the auto-correct load option in the target table.• Including the Table_Comparison transform in the data flow.• Design the data flow to completely replace the target table during

each execution.• Include a preload SQL statement to execute before the table loads.

2 True or False: You can only run a job in recovery mode after the initial run of the job has been set to run with automatic recovery enabled.Answer:True

3 True or False: Automatic recovery allows a failed job to be re-run starting from the point of failure. Automatic recovery only works if the job is unchanged, thus, you can use it for production, development and test environments.Answer:False. You can only use automatic recovery in the production environment.

4 What are the two scripts in a manual recovery work flow used for?Answer:The first script reads the ending time in the status table that corresponds to the most recent start time. This script is to used to indicate whether a previous job was completed successfully.The second script updates the status table with the current timestamp to indicate successful job execution.

5 What is data flow auditing?Answer:Data flow auditing provides a way to ensure that a data flow loads correct data into the warehouse.

6 What must you define in order to audit a data flow?Answer:You must define audit points and audit rules when you want to audit a data flow.

• True or False. The auditing feature is disabled when you run a job with the debugger.Answer:True.

7 Which setting has the highest precedence in recovery mode?• Recover as a Unit• Enable Recovery• Execute Only Once• Recover from Last ExecutionAnswer:


Execute Only Once


Lesson 11

Quiz: Supporting a Multi-user Environment1 True or False. Data Integrator provides feedback through trace, error,

and statistics logs during both parts of the design phase.Answer:False. This is done in the testing phase of the development process.

2 What are dependent objects?Answer:Dependents are objects used by another object, for example, data flows that are called from within a work flow.

3 True or False. You will run across these terms when working in a Data Integrator multi-user team environment: • Highest level object• Object dependents• Object versionAnswer:True

4 Which repository do you use to create and change objects for an application?Answer:Local repository.

5 What must you do if you want to make changes to an object that resides in the Central object library?Answer:You must check out the object from the central repository first and then make the changes.

6 True or False. Logging in a Central Repository is not recommended.Answer:True


Lesson 12

Quiz: Migrating Projects1 Name the two migration methods available in Data Integrator. What

would you need to consider when choosing each method?Answer:• Multi-user migration via central repositories:

This method provides source control for multi-user development and saves old versions of jobs and dependent objects. Multi-user migration works best in larger projects where two or more developers or multiple teams are working on interdependent parts of Data Integrator applications throughout all phases of development.

• Single-user export and import migration:You use a wizard to help you map objects between source and target repositories. Using export and import you can point to information resources to be remapped between repository environments. There is no central repository used with this method. This method works best with small to medium-sized projects where a single or a small number of developers work on somewhat independent applications through all phases of development.

2 List the tasks that are associated in a multi-user migration environment.Answer:• Add projects to the central repository• Get the latest version of a project from the central repository• Update a project in the central repository• Copy contents between central repositories

3 True or False. The export process allows you to change environment-specific information defined in datastores and file formats to match the new environment. Answer: True.

4 True or False. You can only export objects to a file or repository.Answer: False. You can export objects to a file or repository but you can also export a whole repository to a file.

5 What is the advantage of using the datastore configurations during migration?Answer:Multiple datastore configurations enable you to create multiple connections within each datastore so you can easily move jobs from development to test and production servers.

6 True or False. Logging in a Central Repository is not recommended.Answer:True


Lesson 13

Quiz: Using the Administrator1 What are the functions of the Designer and the Administrator?

Answer:The Designer supports batch job administration in development and testing phases of a project. The Administrator provides central administration for multiple repositories and Job Servers in all phases of a project.

2 Which type of user role provides access Status pages (to monitor, start, and stop processes) and the Status Interval and Log Retention Period pages. Answer:The Monitor role: this role provides access to Status pages (to monitor, start, and stop processes) and the Status Interval and Log Retention Period pages.

3 What are the two rules for creating server groups?Answer:• All the Job Servers in an individual server group must be associated

with the same repository.• Each computer can only contribute one Job Server to a server group.


Lesson 14

Quiz: Profiling Data1 Which profiler task enables you to compare non-matching values

between two data sources from two different systems?Answer:Submit a relationship profiler task

2 True or False. You can associate the Data Profiler repository to the same Job Server as the Designer. Answer:True

3 List some of the tasks that you can accomplish when you submit a column profiler task using the Data Profiler?Answer:• Obtain information on basic statistics, frequencies, ranges and

outliers• Identify variations of the same data content • Discover and validating data patterns and formats. Analyzing a

numeric range4 List some of the tasks that you can accomplish when you submit a

relationship profiler task using the Data Profiler?Answer:• Identify and validate redundant data and relationships across data

sources • Identify duplicate name and address and non-name and address

information • Identify missing data, nulls and blanks in the source system.

5 Which Data Integrator tool do you use to view the status of all profiling tasks, and cancel or delete profiling tasks with its generated profile statistics?Answer:The Administrator


Lesson 15

Quiz: Managing Metadata1 Give some examples of types of data that metadata can describe.

Answer:• Technical metadata. For example, as connection information and

detailed schemas • Business metadata. For example, business names• Warehouse data elements. For example, sources, transformations,

and targets• Warehouse processing elements. For example, data used for

scheduling, status reporting, and history recording2 What kind of information can you find in the operational dashboard

category of the Metadata Reports tool?Answer:Operational dashboard reports provide graphical representations of Data Integrator job execution statistics.

3 What are some supported formats for metadata exchange?Answer:• AllFusion Erwin Data Modeler (ERwin)• CWM (Common Warehouse Metamodel• MIMB ( Meta Integration Model Bridge)• BusinessObjects Universal Metadata Bridge

4 What kind of analysis are you doing when you use the Impact & Lineage analysis metadata reports category to view source column or table information?Answer:You perform impact analysis when you look at the end-to-end impact of the selected source table or columns and the targets it affects.

5 What kind of analysis are you doing when you use the Impact & Lineage analysis metadata reports category to view target table information?Answer:You perform lineage analysis when you look at where the information of a target table is coming from or see what makes up a specific column mapping for a target table.

6 What is the difference between profiling data and managing metadata?Answer:Profiling data using the Data Profiler allows you to see the actual data so that you can evaluate it before designing your data transformations. Metadata reporting allows you to determine how the data within your Data Integrator datastores is defined.

—Learner’s Guide 1-45

Notes______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________


________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________


_________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

Training EvaluationWe continuously strive to provide high quality trainers and training guides. To do this, we rely on your feedback. We appreciate you taking the time to complete this evaluation.

Name: _________________________________________ Position: ___________________________________

Company: ___________________________________________________________________________________

Email: ________________________________ Phone: _____________________________________

Course Information

Date: ______________________________________ Course: ____________________________________

Instructor’s Name:____________________________________________________________________________

Location of Training: _______________________________________________________________________

How would you rate the following? Excellent Good Average Poor

Did the course content meet your needs?

Were the training materials clear and easy to understand?

Was the Learner’s Guide complete and accurate?

Were there enough hands-on activities?

Did the instructor have strong presentation skills?

Did the instructor have a technical understanding of the product?

Was the instructor attentive to the class level, adjusting the lessons accordingly?

Was the training facility properly equipped?

How would you rate the course overall?

What did you enjoy most about the course?

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

What did you enjoy least about the course?

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

Comment on this training guide’s completeness, accuracy, organization, usability, and readability.

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

How did you find out about the training class?

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

What changes, if any, would you like to recommend regarding this session?

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

________________________________________________________________________________________

Data Integrator DM370R2 Learner Guide GA

Documents

Transcript of Data Integrator DM370R2 Learner Guide GA