Database and data mining gr oup, Politecnico di Torino...

27
SQL Server 2005 Integration Services 1 Database and data mining group, Politecnico di Torino D B M G Integration Services project An Integration Services project allows managing all ETL processes It is based on Business Intelligence projects of type “Integration Services” Open Visual Studio and create a new Business Intelligence projects of type “I i S i P j SQL Server 2005 Integration Services - 2 Riccardo Dutto, Paolo Garza Politecnico di Torino “Integration Services ProjectDatabase and data mining group, Politecnico di Torino D B M G SQL Server 2005 Integration Services - 3 Riccardo Dutto, Paolo Garza Politecnico di Torino

Transcript of Database and data mining gr oup, Politecnico di Torino...

SQL Server 2005 Integration Services

1

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Integration Services project

• An Integration Services project allows managing all ETL processesg g p

• It is based on Business Intelligence projects of type “Integration Services”

• Open Visual Studio and create a new Business Intelligence projects of type “I i S i P j ”

SQL Server 2005 Integration Services - 2 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

“Integration Services Project”

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 3 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

2

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Control Flow

• A Control Flow defines the flow of tasks by means of the following toolsy g– Data Flow Task

• Basic block to execute data transfer from a source to a destination using transformations

– Execute SQL Task• SQL statements execution

SQL Server 2005 Integration Services - 4 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

– Execute Package• External package execution

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 5 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

3

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Creating a Control Flow

• To create a new Control Flow– drag and drop the objects in the middle of g p j

the main window– connect the objects using the connection

arrows• Connection arrows define the order of

precedence among tasks

SQL Server 2005 Integration Services - 6 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

precedence among tasks• Usually there is no data exchange among

Control Flow tasks– You can use variables

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 7 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

4

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Control flow - Execute SQL Task

• Execute SQL Task allows the execution of SQL statements/scriptsp

• Usually used to– Create tables– Delete data– Insert information into Auditing tables

SQL Server 2005 Integration Services - 8 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Connection to DB

SQL Server 2005 Integration Services - 9 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL statement to execute

SQL Server 2005 Integration Services

5

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Control flow - Execute SQL Task• Before executing a SQL statement, a

connection to a DB must be defined• Creating a new connection

– Name of the server– Name of the data base– Authentication method

SQL Server 2005 Integration Services - 10 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 11 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

6

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Data Flow blocks handle the data processing by means of objects that

Control Flow – Data Flow

p g y j– Define sources and destinations– Transform and add attributes– Merge the content of different sources– …

SQL Server 2005 Integration Services - 12 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

From the combo menu you can choose the Data Flow on which you wish to work among the ones

SQL Server 2005 Integration Services - 13 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

wish to work, among the ones available in the Control Flow

SQL Server 2005 Integration Services

7

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

The Data FlowData sources

Data processing

Data destinations

Data Flow Sources

• SQL Server tables

• Excel spreadsheets

• Text/Flat files (CSV)

Data Flow Transformations

• Data format changes

• Derived fields to be added

• Data split or merge from different sources

Data Flow Destinations

• SQL Server tables

• Excel spreadsheets

• Text/Flat files (CSV)

SQL Server 2005 Integration Services - 14 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

• … different sources

• Data grouping

• …

• …

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• In the Data Flow, connection arrows indicate physical data exchange

Control Flow – Data Flow

p y g• Double clicking on a link

(Data Flow Path) you can– Read the metadata of transferred data– Define data viewers to visualize data being

transferred on the link itself at r ntime

SQL Server 2005 Integration Services - 15 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

transferred on the link itself at runtime• Useful for debugging

SQL Server 2005 Integration Services

8

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Metadata for theselected link

SQL Server 2005 Integration Services - 16 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

selected link

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• OLE DB source– Defines a connection to a SQL Server

Data Flow Source - OLE DB source

source– Specifies which data are used

SQL Server 2005 Integration Services - 17 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

9

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Choose the connection to the desired DB– If the connection does not exist, create a

Data Flow Source - OLE DB source

,new connection

• Choose the table to access to– Allows accessing to the whole table content

• Alternatively, specify a SQL query

SQL Server 2005 Integration Services - 18 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

– Allows accessing to selected data from one or more tables

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Selected columns

Connection to DB in use

Table access method

SQL Server 2005 Integration Services - 19 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

method

• whole table

• SQL query

SQL Server 2005 Integration Services

10

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Used to change the data formatCh h i f

Data Flow TransformationsData Conversion

– Change the string format• E.g., Unicode to non-Unicode

– Numerical precision conversion– …

• Creates a new column for each

SQL Server 2005 Integration Services - 20 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

C eates a e co u o eactransformation

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 21 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Column with input data

New columns generated by the conversion block

Output column format

SQL Server 2005 Integration Services

11

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Two or more sources are merged to create a common output

Data Flow TransformationsUnion all

create a common output– Input data must have the same schema and

format to allow union

SQL Server 2005 Integration Services - 22 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 23 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Output attribute name Input attribute names

SQL Server 2005 Integration Services

12

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Creates new columns– For each row the content of new columns

Data Flow TransformationsDerived columns

For each row the content of new columns may be

• a fixed value• the result of a function applied to other columns

• Replaces the content of a column with a new value

SQL Server 2005 Integration Services - 24 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

new value– The new value may be

• a fixed value• the result of a function

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 25 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

13

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Aggregates the rows according to the aggregating attrib tes

Data Flow TransformationsAggregate

aggregating attributes• Returns the value of aggregated

functions for each group– sum, average, count, …

• When there is only one data source the

SQL Server 2005 Integration Services - 26 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

• When there is only one data source, the same result can be obtained using the group by clause

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Aggregating attributesattributes

SQL Server 2005 Integration Services - 27 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Results of aggregating functions

SQL Server 2005 Integration Services

14

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Returns a given field of another “table”i b h j i b

Data Flow TransformationsLookup

– it behaves as a join, but– can be applied to data flows which come

from different sources

SQL Server 2005 Integration Services - 28 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Choose a connection and a table on which to lookup

Choose the join condition and the field to return as resultp

SQL Server 2005 Integration Services - 29 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

15

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Defines a connection to a SQL Server destination

Data Flow DestinationOLE DB

destination• Specifies which data are used• Last block of all Data Flows

SQL Server 2005 Integration Services - 30 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Specify the connection to the output DB Define the column mapping

SQL Server 2005 Integration Services - 31 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

16

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Executes a child package inside a given package

Data Flow - Execute Package Task

p g• Variable values can be passed from

parent to child packages

SQL Server 2005 Integration Services - 32 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Location of the

Information on the package to be executed

saved package

SQL Server 2005 Integration Services - 33 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

17

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• System variables– Predefined variables with system information

Variables

y• PackageName• ExecStartDT• ...

• User variables– User defined

SQL Server 2005 Integration Services - 34 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

– User defined– Used to manage

• Local/relative paths• Data counters, …

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

User variablesSystem variables

SQL Server 2005 Integration Services - 35 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

18

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Counts the number of rows flowing thro gh a link

Data Flow TransformationsRow count

through a link– the block must be put on the link to analyze

• The result is stored in a user variable– Useful to debug and log operations

SQL Server 2005 Integration Services - 36 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

User variable name where the result of the count is stored

SQL Server 2005 Integration Services - 37 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

19

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Dimensions– identify additions and changes

Incremental data loading

– SQL Server 2005 has a component dedicated to changing dimensions

• Fact table– indentify the new rows

• usually user variables are used to identify the

SQL Server 2005 Integration Services - 38 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

• usually user variables are used to identify the temporal period of interest

• rows in source tables must have the insert/update timestamp

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Automatically generates the blocks needed to handle dimension changes

Data Flow TransformationsSlowly Changing Dimension

needed to handle dimension changes• The user chooses how to handle

dimension changes for each attribute of the changing dimension– The choice, among the available types, must

SQL Server 2005 Integration Services - 39 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

g ypbe done during the conceptual design of the data warehouse

SQL Server 2005 Integration Services

20

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 40 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

This block handles the rows with changed attributes

This block handles new rows

This block handles errors

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• During the ETL process, tasks are executed according to the Control Flow

ETL process

gsequence– they are executed concurrently if possible

• Executing tasks are yellow• Successfully completed tasks are green

SQL Server 2005 Integration Services - 41 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

• Tasks with errors are red

SQL Server 2005 Integration Services

21

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• On the links among the Data Flow blocks, the number of processed rows is

ETL process

pshown

• Data Viewers allow in depth analysis of data flowing on links– useful to debug

SQL Server 2005 Integration Services - 42 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

SQL Server 2005 Integration Services - 43 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

22

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• The ETL process (as a package) can be

Automatic ETL processing

automatically executed using– dtexec– SQL Agent

SQL Server 2005 Integration Services - 44 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Errors must be handled, otherwise– the whole package fails

Error handling

– you do not know where the error lies• Handling errors allows to

– successfully load the data that do not generate errorssave the data that do generate errors (e g

SQL Server 2005 Integration Services - 45 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

– save the data that do generate errors (e.g., in a log file)

SQL Server 2005 Integration Services

23

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• The “Error” preference of blocks can be set to the following values

Redirect row

Error handling

– Redirect row• The rows that generate errors are redirected to the

error link (i.e., the red output line of the block)– Ignore failure

• Rows that generate errors are ignored, but the process goes on with the following rows

SQL Server 2005 Integration Services - 46 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

process goes on with the following rows– Fail component (default value)

• The component fails• No data are loaded

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Type of error handling

SQL Server 2005 Integration Services - 47 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

SQL Server 2005 Integration Services

24

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Data that cannot be stored into the destination data base are saved in a log file

SQL Server 2005 Integration Services - 48 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• During the ETL, executed processes (packages) must be tracked

Metadata - Auditing

– State– Execution time– Number of rows analyzed by each process– Number of errors

SQL Server 2005 Integration Services - 49 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

– ...

SQL Server 2005 Integration Services

25

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Auditing information / Metadata are usually stored inside a relational table

Metadata - Auditing

– Easy to update during the ETL process execution

– Easy to analyze by means of SQL queries

SQL Server 2005 Integration Services - 50 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• In the Control Flow, insert a block at the beginning of each process saving

Metadata - Auditing

information before execution• Insert a block at the end of each process

saving information of execution and result

Number of processed rows

SQL Server 2005 Integration Services - 51 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

– Number of processed rows– Number of errors

SQL Server 2005 Integration Services

26

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• User variables– Information of the process execution and

Metadata - Auditing

completion, at the end of the block• Number of processed rows• Number of errors

• Text files or data base– Rows that generated errors

SQL Server 2005 Integration Services - 52 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

g

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

• Automatic process log system provided by SQL Server 2005

SSIS Log

• Allows to access and save the basic process/package information

• Information can be saved in files, a SQL Server data base, …I f ti th t t b d

SQL Server 2005 Integration Services - 53 Riccardo Dutto, Paolo GarzaPolitecnico di Torino

• Information that cannot be saved– Number of processed rows– Rows that generated errors

SQL Server 2005 Integration Services

27

Database and data mining group, Politecnico di Torino

DataBase and Data Mining Group of Politecnico di Torino

DBMG

Task to log

Type of events to log

g

SQL Server 2005 Integration Services - 54 Riccardo Dutto, Paolo GarzaPolitecnico di Torino