Database and data mining gr oup, Politecnico di Torino...
Transcript of Database and data mining gr oup, Politecnico di Torino...
SQL Server 2005 Integration Services
1
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Integration Services project
• An Integration Services project allows managing all ETL processesg g p
• It is based on Business Intelligence projects of type “Integration Services”
• Open Visual Studio and create a new Business Intelligence projects of type “I i S i P j ”
SQL Server 2005 Integration Services - 2 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
“Integration Services Project”
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 3 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
2
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Control Flow
• A Control Flow defines the flow of tasks by means of the following toolsy g– Data Flow Task
• Basic block to execute data transfer from a source to a destination using transformations
– Execute SQL Task• SQL statements execution
SQL Server 2005 Integration Services - 4 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
– Execute Package• External package execution
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 5 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
3
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Creating a Control Flow
• To create a new Control Flow– drag and drop the objects in the middle of g p j
the main window– connect the objects using the connection
arrows• Connection arrows define the order of
precedence among tasks
SQL Server 2005 Integration Services - 6 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
precedence among tasks• Usually there is no data exchange among
Control Flow tasks– You can use variables
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 7 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
4
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Control flow - Execute SQL Task
• Execute SQL Task allows the execution of SQL statements/scriptsp
• Usually used to– Create tables– Delete data– Insert information into Auditing tables
SQL Server 2005 Integration Services - 8 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Connection to DB
SQL Server 2005 Integration Services - 9 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL statement to execute
SQL Server 2005 Integration Services
5
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Control flow - Execute SQL Task• Before executing a SQL statement, a
connection to a DB must be defined• Creating a new connection
– Name of the server– Name of the data base– Authentication method
SQL Server 2005 Integration Services - 10 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 11 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
6
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Data Flow blocks handle the data processing by means of objects that
Control Flow – Data Flow
p g y j– Define sources and destinations– Transform and add attributes– Merge the content of different sources– …
SQL Server 2005 Integration Services - 12 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
From the combo menu you can choose the Data Flow on which you wish to work among the ones
SQL Server 2005 Integration Services - 13 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
wish to work, among the ones available in the Control Flow
SQL Server 2005 Integration Services
7
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
The Data FlowData sources
Data processing
Data destinations
Data Flow Sources
• SQL Server tables
• Excel spreadsheets
• Text/Flat files (CSV)
Data Flow Transformations
• Data format changes
• Derived fields to be added
• Data split or merge from different sources
Data Flow Destinations
• SQL Server tables
• Excel spreadsheets
• Text/Flat files (CSV)
SQL Server 2005 Integration Services - 14 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
• … different sources
• Data grouping
• …
• …
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• In the Data Flow, connection arrows indicate physical data exchange
Control Flow – Data Flow
p y g• Double clicking on a link
(Data Flow Path) you can– Read the metadata of transferred data– Define data viewers to visualize data being
transferred on the link itself at r ntime
SQL Server 2005 Integration Services - 15 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
transferred on the link itself at runtime• Useful for debugging
SQL Server 2005 Integration Services
8
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Metadata for theselected link
SQL Server 2005 Integration Services - 16 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
selected link
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• OLE DB source– Defines a connection to a SQL Server
Data Flow Source - OLE DB source
source– Specifies which data are used
SQL Server 2005 Integration Services - 17 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
9
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Choose the connection to the desired DB– If the connection does not exist, create a
Data Flow Source - OLE DB source
,new connection
• Choose the table to access to– Allows accessing to the whole table content
• Alternatively, specify a SQL query
SQL Server 2005 Integration Services - 18 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
– Allows accessing to selected data from one or more tables
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Selected columns
Connection to DB in use
Table access method
SQL Server 2005 Integration Services - 19 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
method
• whole table
• SQL query
SQL Server 2005 Integration Services
10
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Used to change the data formatCh h i f
Data Flow TransformationsData Conversion
– Change the string format• E.g., Unicode to non-Unicode
– Numerical precision conversion– …
• Creates a new column for each
SQL Server 2005 Integration Services - 20 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
C eates a e co u o eactransformation
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 21 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Column with input data
New columns generated by the conversion block
Output column format
SQL Server 2005 Integration Services
11
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Two or more sources are merged to create a common output
Data Flow TransformationsUnion all
create a common output– Input data must have the same schema and
format to allow union
SQL Server 2005 Integration Services - 22 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 23 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Output attribute name Input attribute names
SQL Server 2005 Integration Services
12
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Creates new columns– For each row the content of new columns
Data Flow TransformationsDerived columns
For each row the content of new columns may be
• a fixed value• the result of a function applied to other columns
• Replaces the content of a column with a new value
SQL Server 2005 Integration Services - 24 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
new value– The new value may be
• a fixed value• the result of a function
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 25 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
13
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Aggregates the rows according to the aggregating attrib tes
Data Flow TransformationsAggregate
aggregating attributes• Returns the value of aggregated
functions for each group– sum, average, count, …
• When there is only one data source the
SQL Server 2005 Integration Services - 26 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
• When there is only one data source, the same result can be obtained using the group by clause
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Aggregating attributesattributes
SQL Server 2005 Integration Services - 27 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Results of aggregating functions
SQL Server 2005 Integration Services
14
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Returns a given field of another “table”i b h j i b
Data Flow TransformationsLookup
– it behaves as a join, but– can be applied to data flows which come
from different sources
SQL Server 2005 Integration Services - 28 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Choose a connection and a table on which to lookup
Choose the join condition and the field to return as resultp
SQL Server 2005 Integration Services - 29 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
15
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Defines a connection to a SQL Server destination
Data Flow DestinationOLE DB
destination• Specifies which data are used• Last block of all Data Flows
SQL Server 2005 Integration Services - 30 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Specify the connection to the output DB Define the column mapping
SQL Server 2005 Integration Services - 31 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
16
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Executes a child package inside a given package
Data Flow - Execute Package Task
p g• Variable values can be passed from
parent to child packages
SQL Server 2005 Integration Services - 32 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Location of the
Information on the package to be executed
saved package
SQL Server 2005 Integration Services - 33 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
17
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• System variables– Predefined variables with system information
Variables
y• PackageName• ExecStartDT• ...
• User variables– User defined
SQL Server 2005 Integration Services - 34 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
– User defined– Used to manage
• Local/relative paths• Data counters, …
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
User variablesSystem variables
SQL Server 2005 Integration Services - 35 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
18
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Counts the number of rows flowing thro gh a link
Data Flow TransformationsRow count
through a link– the block must be put on the link to analyze
• The result is stored in a user variable– Useful to debug and log operations
SQL Server 2005 Integration Services - 36 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
User variable name where the result of the count is stored
SQL Server 2005 Integration Services - 37 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
19
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Dimensions– identify additions and changes
Incremental data loading
– SQL Server 2005 has a component dedicated to changing dimensions
• Fact table– indentify the new rows
• usually user variables are used to identify the
SQL Server 2005 Integration Services - 38 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
• usually user variables are used to identify the temporal period of interest
• rows in source tables must have the insert/update timestamp
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Automatically generates the blocks needed to handle dimension changes
Data Flow TransformationsSlowly Changing Dimension
needed to handle dimension changes• The user chooses how to handle
dimension changes for each attribute of the changing dimension– The choice, among the available types, must
SQL Server 2005 Integration Services - 39 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
g ypbe done during the conceptual design of the data warehouse
SQL Server 2005 Integration Services
20
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 40 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
This block handles the rows with changed attributes
This block handles new rows
This block handles errors
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• During the ETL process, tasks are executed according to the Control Flow
ETL process
gsequence– they are executed concurrently if possible
• Executing tasks are yellow• Successfully completed tasks are green
SQL Server 2005 Integration Services - 41 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
• Tasks with errors are red
SQL Server 2005 Integration Services
21
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• On the links among the Data Flow blocks, the number of processed rows is
ETL process
pshown
• Data Viewers allow in depth analysis of data flowing on links– useful to debug
SQL Server 2005 Integration Services - 42 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
SQL Server 2005 Integration Services - 43 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
22
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• The ETL process (as a package) can be
Automatic ETL processing
automatically executed using– dtexec– SQL Agent
SQL Server 2005 Integration Services - 44 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Errors must be handled, otherwise– the whole package fails
Error handling
– you do not know where the error lies• Handling errors allows to
– successfully load the data that do not generate errorssave the data that do generate errors (e g
SQL Server 2005 Integration Services - 45 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
– save the data that do generate errors (e.g., in a log file)
SQL Server 2005 Integration Services
23
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• The “Error” preference of blocks can be set to the following values
Redirect row
Error handling
– Redirect row• The rows that generate errors are redirected to the
error link (i.e., the red output line of the block)– Ignore failure
• Rows that generate errors are ignored, but the process goes on with the following rows
SQL Server 2005 Integration Services - 46 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
process goes on with the following rows– Fail component (default value)
• The component fails• No data are loaded
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Type of error handling
SQL Server 2005 Integration Services - 47 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
SQL Server 2005 Integration Services
24
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
Data that cannot be stored into the destination data base are saved in a log file
SQL Server 2005 Integration Services - 48 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• During the ETL, executed processes (packages) must be tracked
Metadata - Auditing
– State– Execution time– Number of rows analyzed by each process– Number of errors
SQL Server 2005 Integration Services - 49 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
– ...
SQL Server 2005 Integration Services
25
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Auditing information / Metadata are usually stored inside a relational table
Metadata - Auditing
– Easy to update during the ETL process execution
– Easy to analyze by means of SQL queries
SQL Server 2005 Integration Services - 50 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• In the Control Flow, insert a block at the beginning of each process saving
Metadata - Auditing
information before execution• Insert a block at the end of each process
saving information of execution and result
Number of processed rows
SQL Server 2005 Integration Services - 51 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
– Number of processed rows– Number of errors
SQL Server 2005 Integration Services
26
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• User variables– Information of the process execution and
Metadata - Auditing
completion, at the end of the block• Number of processed rows• Number of errors
• Text files or data base– Rows that generated errors
SQL Server 2005 Integration Services - 52 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
g
Database and data mining group, Politecnico di Torino
DataBase and Data Mining Group of Politecnico di Torino
DBMG
• Automatic process log system provided by SQL Server 2005
SSIS Log
• Allows to access and save the basic process/package information
• Information can be saved in files, a SQL Server data base, …I f ti th t t b d
SQL Server 2005 Integration Services - 53 Riccardo Dutto, Paolo GarzaPolitecnico di Torino
• Information that cannot be saved– Number of processed rows– Rows that generated errors