5 IBM Presentation 268018 7
-
Upload
vincenzo-presutto -
Category
Documents
-
view
214 -
download
0
Transcript of 5 IBM Presentation 268018 7
-
7/28/2019 5 IBM Presentation 268018 7
1/43
2008 IBM Corporation1
Three courses of DataStage, with a sideorder of TeradataStewart HannaProduct Manager
-
7/28/2019 5 IBM Presentation 268018 7
2/43
2
Information Management Software
Agenda
Platform Overview
Appetizer - Productivity First Course Extensibility
Second Course Scalability
Desert Teradata
-
7/28/2019 5 IBM Presentation 268018 7
3/43
3
Information Management Software
Agenda
Platform Overview
Appetizer - Productivity First Course Extensibility
Second Course Scalability
Desert Teradata
-
7/28/2019 5 IBM Presentation 268018 7
4/434
Information Management Software
Business
Value
Maturity of
Information Use
Accelerate to the Next Level
Unlocking the Business Value of Information for Competitive Advantage
-
7/28/2019 5 IBM Presentation 268018 7
5/435
Information Management Software
An Industry Unique Information Platform
Simplify the delivery of Trusted Information
Accelerate client value
Promote collaboration
Mitigate risk
Modular but Integrated
Scalable Project to Enterprise
The IBM InfoSphere Vision
-
7/28/2019 5 IBM Presentation 268018 7
6/436
Information Management Software
IBM InfoSphere Information Server
Delivering information you can trust
-
7/28/2019 5 IBM Presentation 268018 7
7/437
Information Management Software
InfoSphere DataStage
Provides codeless visual design ofdata flows with hundreds of buil t-intransformation functions
Optimized reuse of integrationobjects
Supports batch & real-timeoperations
Produces reusable components thatcan be shared across projects
Complete ETL functionality withmetadata-driven productivity
Supports team-based developmentand collaboration
Provides integration from across thebroadest range of sources
Transform
Transform and aggregate any volume
of information in batch or real timethrough visually designed logic
Hundreds of Built-in
Transformation Functions
ArchitectsDevelopers
InfoSphere DataStage
Deliver
-
7/28/2019 5 IBM Presentation 268018 7
8/438
Information Management Software
Agenda
Platform Overview
Appetizer - Productivity First Course Extensibility
Second Course Scalability
Desert Teradata
-
7/28/2019 5 IBM Presentation 268018 7
9/43
9
Information Management Software
DataStage Designer
Completedevelopment
environment Graphical, drag and
drop top down design
metaphor
Develop sequentially,deploy in parallel
Component-basedarchitecture
Reuse capabilities
-
7/28/2019 5 IBM Presentation 268018 7
10/43
10
Information Management Software
Transformation Components
Over 60 pre-built componentsavailable including
Files
Database
Lookup
Sort, Aggregation,
Transformer
Pivot, CDC
J oin, Merge
Filter, Funnel, Switch, Modify
Remove Duplicates
Restructure stages
Sample ofStages
Avai lable
-
7/28/2019 5 IBM Presentation 268018 7
11/43
11
Information Management Software
Some Popular Stages
Usual ETL Sources & Targets:- RDBMS, Sequential File, Data Set
Combining Data:- Lookup, Joins, Merge
- Aggregator
Transform Data:- Transformer, Remove Duplicates
Ancillary:- Row Generator, Peek, Sort
-
7/28/2019 5 IBM Presentation 268018 7
12/43
12
Information Management Software
Deployment
Easy and integrated jobmovement from one environmentto the other
A deployment package can becreated with any first class objectsfrom the repository. New objects
can be added to an existingpackage.
The description of this package isstored in the metadata repository.
All associated objects can beadded in like files outside of DSand QS like scripts.
Audit and Security Control
-
7/28/2019 5 IBM Presentation 268018 7
13/43
13
Information Management Software
Role-Based Tools with Integrated Metadata
Simpli fy Integration Increase trust andconfidence in information
Increase compliance tostandards
Facil itate changemanagement & reuseDesign Operational
DevelopersSubject MatterExperts
DataAnalysts
BusinessUsers
Architects DBAs
Unified Metadata Management
-
7/28/2019 5 IBM Presentation 268018 7
14/43
14
Information Management Software
Business analysts and ITcollaborate in context tocreate project specification
Leverages sourceanalysis, target models,and metadata to facilitatemapping process
Auto-generation ofdata transformationjobs & reports
Generate historicaldocumentation for tracking
Supports data governance
Flexible Reporting
Specification
Auto-generatesDataStage jobs
InfoSphere FastTrackTo reduce Costs of Integration Projects through Automation
-
7/28/2019 5 IBM Presentation 268018 7
15/43
15
Information Management Software
Agenda Platform Overview
Appetizer - Productivity First Course Extensibility
Second Course Scalability
Desert Teradata
-
7/28/2019 5 IBM Presentation 268018 7
16/43
16
Information Management Software
Extending DataStage - Defining Your Own Stage Types
Define your own stage type to beintegrated into data flow
Stage Types
Wrapped Specify a OS command or
script
Existing Routines, Logic, Apps
BuildOp Wizard / Macro driven
Development
Custom
API Development
Available to all jobs in the project
All meta data is captured
-
7/28/2019 5 IBM Presentation 268018 7
17/43
17
Information Management Software
Building WrappedStages
In a nutshell:
You can wrap a legacy executable:
binary, Unix command,
shell script
and turn it into a bona fide DataStage stage capable, amongother things, of parallel execution,
as long as the legacy executable is
amenable to data-partition parallelism
no dependencies between rows
pipe-safe
can read rows sequentially
no random access to data, e.g., use of fseek()
-
7/28/2019 5 IBM Presentation 268018 7
18/43
18
Information Management Software
Building BuildOp Stages In a nutshell:
The user performs the fun, glamorous tasks:encapsulate business logic in a custom operator
The DataStage wizard called buildop automaticallyperforms the unglamorous, tedious, error-prone tasks:
invoke needed header files, build the necessaryplumbing for a correct and efficient parallel
execution.
-
7/28/2019 5 IBM Presentation 268018 7
19/43
19
Information Management Software
Layout interfaces describe what columns the stage:
needs for its inputs
creates for its outputs
Two kinds of interfaces: dynamic and static
Dynamic: adjusts to itsinputs automatically
Custom Stages can beDynamic or Static
IBM DataStage suppliedStages are dynamic
Static: expects input to
contain columns withspecific names and types
BuildOp are only Static
Major difference between BuildOp and Custom Stages
I f ti M t S ft
-
7/28/2019 5 IBM Presentation 268018 7
20/43
20
Information Management Software
Agenda Platform Overview
Appetizer - Productivity
First Course Extensibility
Second Course Scalability
Desert Teradata
I f ti M t S ft
-
7/28/2019 5 IBM Presentation 268018 7
21/43
21
Information Management Software
Scalability is important everywhere
It take me 4 1/2 hours to wash, dry and fold 3 loads oflaundry ( hour for each operation)
Wash Dry Fold
Sequential Approach (4 Hours)
Wash a load, dry the load, fold it
Wash-dry-fold
Wash-dry-fold
Wash Dry FoldPipeline Approach (2 Hours)
Wash a load, when it is done, put it in the dryer, etc
Load washing while another load is drying, etc
Wash Dry FoldWash Dry Fold
Partitioned Parallelism Approach (1 Hours)
Divide the laundry into different loads (whites, darks,linens)
Work on each piece independently
3 Times faster with 3 Washing machines, continue with 3dryers and so on
Wash Dry FoldPartitio
nin
g
Repartitio
nin
g
Whites, Darks, Linens
WorkclothesDelicates, Lingerie, Outside
LinensShirts on Hangers, pants
folded
Partitio
nin
g
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
22/43
Information Management Software
SourceData
Processor 1
Processor 2
Processor 3
Processor 4
A-F
G- M
N-T
U-Z
Break up big data into partitionsRun one partition on each processor4X times faster on 4 processors; 100X faster on 100 processors
Partitioning is specified per stage meaning partitioning can changebetween stages
Transform
Transform
Transform
Transform
Data Partitioning
Typeso
f
Partitioning
options
available
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
23/43
Information Management Software
Data Pipelining
Source Target
Archived Data
1,000 recordsfirst chunk
1,000 recordssecond chunk
1,000 recordsthird chunk
1,000 recordsfourth chunk
1,000 recordsfifth chunk
1,000 recordssixth chunk
Records 9,001to 100,000,000
1,000 recordsseventh chunk
1,000 recordseighth chunk
Load
Eliminate the write to disk and the read from diskbetween processes
Start a downstream process while an upstreamprocess is still running.
This eliminates intermediate staging to disk, which iscritical for big data.
This also keeps the processors busy.
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
24/43
Information Management Software
TargetSource
Partiti
onin
g
Transform
Parallel Dataflow
Parallel Processing achieved in a data flow
Still limiting
Partitioning remains constant throughout flow
Not realistic for any realjobs
For example, what if transformations are based on customer idand enrichment is a house holding task (i.e., based on post code)
Aggregate Load
SourceData
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
25/43
Information Management Software
TargetSource
Pipelining
SourceData
Partitio
nin
g
A-F
G- MN-T
U-Z
Customer last name Customer Postcode Credit card number
LoadRep
artitio
ning
Rep
artitio
nin
g
Transform Aggregate
Parallel Data Flow with Auto Repartitioning
Record repartitioning occurs automatically
No need to repartition data as you
add processors
change hardware architecture
Broad range of partitioning methods
Entire, hash, modulus, random, round robin, same, DB2 range
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
26/43
26
Information Management Software
Partitio
nin
g
DataWarehouse
SourceData A-F
G- MN-T
U-Z
Repartiti
oning
Repartitio
nin
g
Sequential
4-way parallel
128-way parallel
32 4-way nodes
....
....
....
Appl icat ion Execution: Sequential or Parallel
Partitio
ning
Target
DataWarehouse
Source
SourceData A-F
G- MN-T
U-Z
Customer last name Custzip code Credit card number
Repartiti
oning
Repartitio
ning
Target
SourceData
DataWarehouse
Source
Target
SourceData
DataWarehouse
Source
Partitio
nin
g
DataWarehouse
SourceData A-F
G- M
N-TU-Z
Repartition
ing
Repartition
ing
Application Assembly: One Dataf low Graph Created With the DataStage GUI
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
27/43
27
Information Management Software
Data Set stage: allows you to read data from or write data to a data set.Parallel Extender data sets hide the complexities of handling and storing largecollections of records in parallel across the disks of a parallel computer.
File Set stage: allows you to read data from or write data to afile set. It only executes in parallel mode.
Sequential File stage: reads data from or writes data to one or moreflat files. It usually executes in parallel, but can be configured to executesequentially.
External Source stage: allows you to read data that isoutput from one or more source programs.
External Target stage: allows you to write data to one or moretarget programs.
Robust mechanisms for handling big data
-
7/28/2019 5 IBM Presentation 268018 7
28/43
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
29/43
g
DataStage SAS Stages
Parallel SAS Data Set stage:
Allows you to read data from or write data to a parallel SAS data set inconjunction with a SAS stage.
A parallel SAS Data Set is a set of one or more sequential SAS datasets, with a header file specifying the names and locations of all of the
component files.
SAS stage:
Executes part or all of a SAS application in parallel by processing
parallel streams of data with parallel instances of SAS DATA andPROC steps.
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
30/43
30
The Configuration File{
node "n1" {f ast name "s1"pool " " "n1" "s1" "app2" "sort "r esour ce di sk "/ or ch/ n1/ d1" {}r esour ce di sk "/ or ch/ n1/ d2" {"bi gdat a"}r esour ce scrat chdi sk "/ t emp" {"sor t "}
}node "n2" {
f ast name "s2"pool " " "n2" "s2" "app1"r esour ce di sk "/ or ch/ n2/ d1" {}r esour ce di sk "/ or ch/ n2/ d2" {"bi gdat a"}r esour ce scr at chdi sk " / t emp" {}
}node "n3" {
f ast name "s3"pool " " "n3" "s3" "app1"r esour ce di sk "/ or ch/ n3/ d1" {}r esour ce scr at chdi sk " / t emp" {}
}node "n4" {
f ast name "s4"pool " " "n4" "s4" "app1"r esour ce di sk "/ or ch/ n4/ d1" {}r esour ce scr at chdi sk " / t emp" {}
}
1
43
2
Two key aspects:
1. # nodes declared
2. defining subset of resources"pool" for execution under"constraints," i.e., using asubset of resources
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
31/43
31
Dynamic GRID Capabilities GRID Tab to help manage execution across a grid
Automatically reconfigures parallelism to fit GRID
resources
Managed through an external grid resource manager
Available for Red Hat and Tivoli Work Scheduler Loadleveler
Locks resources at execution t ime to ensure SLAs
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
32/43
32
J ob Monitoring & Logging Job Performance Analysis
Detail job monitoring
information available duringand after job execution
Start and elapsed times
Record counts per link
% CPU used by each process
Data skew across partitions
Also available from
command line dsj ob r epor t
[ ]
type =BASIC, DETAIL, XML
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
33/43
33
Provides automatic optimizationof data flows mappingtransformation logic to SQL
Leverages investments in DBMShardware by executing data
integration tasks with and withinthe DBMS
Optimizes job run-time by
allowing the developer to controlwhere the job or various parts ofthe job will execute.
Transform and aggregate any volume
of information in batch or real time
through visually designed logic
ArchitectsDevelopers
Optimizing run time through intelligent
use of DBMS hardware
InfoSphere DataStage Balanced OptimizationData Transformation
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
34/43
34
Using Balanced Optimization
originalDataStagejob
c designjob
DataStageDesigner
job
results
d compile& run e verify
rewrittenoptimized
job
f optimizejob
Balanced Optimization
g compile& run
h choose different opt ionsand reoptimize
i manually review/edit optimized job
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
35/43
35
Leveraging best-of-breed systems
Optimization is not constrained to a singleimplementation style such as ETL or ELT
InfoSphere DataStage BalancedOptimization fully harnesses availablecapacity and computing power in Teradata
and DataStage
Delivering unlimited scalability and
performance through parallel executioneverywhere, all the time
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
36/43
36
Agenda Platform Overview
Appetizer - Productivity
First Course Extensibility
Second Course Scalability
Desert Teradata
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
37/43
37
InfoSphere Rich Connectivity
Rich
Connectivity
Shared, easy to
use connectivity
infrastructure
Event-driven,
real-time, and
batch
connectivity
High volume,
parallel connectivity
to databases and
file systems
Best-of-breed,
metadata-driven
connectivity to
enterprise applications
Frictionless Connectivity
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
38/43
38
Teradata Connectivity
7+ highly optimized interfaces forTeradata leveraging Teradata Toolsand Utilities: FastLoad, FastExport,
TPUMP, MultiLoad, TeradataParallel Transport, CLI and ODBC
You use the best interfacesspecifically designed for your
integration requirements:
Parallel/bulk extracts/loads withvarious/optimized data partitioning
optionsTable maintenance
Real-time/transactional trickle
feeds without table locking
FastLoad
FastExport
MulitLoad
TPUMP
TPT
CLI
ODBC
TeradataEDW
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
39/43
39
Teradata Connectivty RequestedSessions determines the total number of distr ibuted
connections to the Teradata source or target
When not specified, it equals the number of Teradata VPROCs
(AMPs) Can set between 1 and number of VPROCs
SessionsPerPlayerdetermines the number of connections each player willhave to Teradata. Indirectly, it also determines the number of players
(degree of parallelism). Default is 2 sessions / player
The number selected should be such thatSessionsPerPlayer * number of nodes * number of players pernode = RequestedSessions
Setting the value ofSessionsPerPlayertoo low on a large systemcan result in so many players that the job fails due to insufficientresources. In that case, the value for -SessionsPerPlayer shouldbe increased.
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
40/43
40
Teradata Server
MPP with 4 TPA nodes
4 AMPs per TPA node
DataStage Server
Configuration File Sessions Per Player Total Sessions
16 nodes 1 16
8 nodes 1 8
8 nodes 2 16
4 nodes 4 16
Example Settings
Teradata SessionsPerPlayer Example
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
41/43
41
Customer Example Massive Throughput
CSA Production Linux Grid
Teradata
TeradataNode
TeradataNode
TeradataNode
TeradataNode
TeradataNode
TeradataNode
TeradataNode
TeradataNode
TeradataNode
TeradataNode
Head Node Compute A Compute B
TeradataNode
TeradataNode
Compute C Compute D Compute E
24 Port Switch24 Port Switch
32 GbInterconnect
Legend
Teradata
Linux Grid
Stacked Switches
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
42/43
42
The IBM InfoSphere Information Server Advantage
A Complete Information Infrastructure
A comprehensive, unified foundation for enterprise informationarchitectures, scalable to any volume and processing requirement
Auditable data quality as a foundation for trusted informationacross the enterprise
Metadata-driven integration, providing breakthrough product ivityand flexibility for integrating and enriching information
Consistent, reusable information services along withapplication services and process services, an enterprise essential
Accelerated time to value with proven, industry-aligned solut ionsand expertise
Broadest and deepest connectivity to information across diversesources: structured, unstructured, mainframe, and applications
Information Management Software
-
7/28/2019 5 IBM Presentation 268018 7
43/43
43