Post on 12-Oct-2020
Table of ContentsPreface 1
Chapter 1: Getting Started with Pentaho Data Integration 7
Pentaho Data Integration and Pentaho BI Suite 7Introducing Pentaho Data Integration 9Using PDI in real-world scenarios 10
Loading data warehouses or data marts 11Integrating data 12Data cleansing 12Migrating information 12Exporting data 12
Integrating PDI along with other Pentaho tools 13Installing PDI 13Launching the PDI Graphical Designer - Spoon 14
Starting and customizing Spoon 15Exploring the Spoon interface 19
Extending the PDI functionality through the Marketplace 21Introducing transformations 24
The basics about transformations 24Creating a Hello World! Transformation 25
Designing a Transformation 25Previewing and running a Transformation 29
Installing useful related software 32Summary 33
Chapter 2: Getting Started with Transformations 34
Designing and previewing transformations 34Getting familiar with editing features 34
Using the mouseover assistance toolbar 35Adding steps and creating hops 36Working with grids 37
Designing transformations 38Putting the editing features in practice 38Previewing and fixing errors as they appear 44Looking at the results in the execution results pane 48
The Logging tab 48The Step Metrics tab 49
Running transformations in an interactive fashion 49
Table of Contents
[ ii ]
Understanding PDI data and metadata 54Understanding the PDI rowset 54Adding or modifying fields by using different PDI steps 57Explaining the PDI data types 59
Handling errors 60Implementing the error handling functionality 61Customizing the error handling 64
Summary 67
Chapter 3: Creating Basic Task Flows 68
Introducing jobs 68Learning the basics about jobs 68Creating a Simple Job 70
Designing and running jobs 72Revisiting the Spoon interface and the editing features 72Designing jobs 75
Getting familiar with the job design process 75Looking at the results in the Execution results window 77
The Logging tab 78The Job metrics tab 78
Enriching your work by sending an email 79Running transformations from a Job 83
Using the Transformation Job Entry 84Understanding and changing the flow of execution 86
Changing the flow of execution based on conditions 87Forcing a status with an abort Job or success entry 89Changing the execution to be synchronous 89
Managing files 91Creating a Job that moves some files 91
Selecting files and folders 94Working with regular expressions 95
Summarizing the Job entries that deal with files 96Customizing the file management 97
Knowing the basics about Kettle variables 99Understanding the kettle.properties file 100How and when you can use variables 101
Summary 102
Chapter 4: Reading and Writing Files 103
Reading data from files 103Reading a simple file 103
Troubleshooting reading files 108
Table of Contents
[ iii ]
Learning to read all kind of files 109Specifying the name and location of the file 110
Reading several files at the same time 110Reading files that are compressed or located on a remote server 112Reading a file whose name is known at runtime 113
Describing the incoming fields 116Reading Date fields 116Reading Numeric fields 117
Reading only a subset of the file 118Reading the most common kinds of sources 118
Reading text files 119Reading spreadsheets 119Reading XML files 120Reading JSON files 121
Outputting data to files 121Creating a simple file 121Learning to create all kind of files and write data into them 124
Providing the name and location of an output file 124Creating a file whose name is known only at runtime 124Creating several files whose name depend on the content of the file 126
Describing the content of the output file 128Formatting Date fields 128Formatting Numeric fields 129
Creating the most common kinds of files 129Creating text files 129Creating spreadsheets 130Creating XML files 130Creating JSON files 131
Working with Big Data and cloud sources 131Reading files from an AWS S3 instance 131Writing files to an AWS S3 instance 132Getting data from HDFS 133Sending data to HDFS 134
Summary 135
Chapter 5: Manipulating PDI Data and Metadata 136
Manipulating simple fields 136Working with strings 136
Extracting parts of strings using regular expressions 138Searching and replacing using regular expressions 140
Doing some math with Numeric fields 143Operating with dates 143
Performing simple operations on dates 144Subtracting dates with the Calculator step 145
Getting information relative to the current date 146
Table of Contents
[ iv ]
Using the Get System Info step 147Performing other useful operations on dates 148
Getting the month names with a User Defined Java Class step 148Modifying the metadata of streams 150
Working with complex structures 152Working with XML 152
Introducing XML terminology 153Getting familiar with the XPath notation 153Parsing XML structures with PDI 155
Reading an XML file with the Get data from XML step 155Parsing an XML structure stored in a field 158
PDI Transformation and Job files 160Parsing JSON structures 162
Introducing JSON terminology 162Getting familiar with the JSONPath notation 163Parsing JSON structures with PDI 164
Reading a JSON file with the JSON input step 164Parsing a JSON structure stored in a field 165
Summary 166
Chapter 6: Controlling the Flow of Data 167
Filtering data 167Filtering rows upon conditions 167
Reading a file and getting the list of words found in it 168Filtering unwanted rows with a Filter rows step 170Filtering rows by using the Java Filter step 173
Filtering data based on row numbers 174Splitting streams unconditionally 175
Copying rows 177Distributing rows 181Introducing partitioning and clustering 182
Splitting the stream based on conditions 184Splitting a stream based on a simple condition 184Exploring PDI steps for splitting a stream based on conditions 187
Merging streams in several ways 189Merging two or more streams 189Customizing the way of merging streams 193
Looking up data 194Looking up data with a Stream lookup step 195
Summary 200
Chapter 7: Cleansing, Validating, and Fixing Data 201
Cleansing data 201
Table of Contents
[ v ]
Cleansing data by example 202Standardizing information 202Improving the quality of data 204
Introducing PDI steps useful for cleansing data 206Dealing with non-exact matches 207
Cleansing by doing a fuzzy search 207Deduplicating non-exact matches 213
Validating data 215Validating data with PDI 216
Validating and reporting errors to the log 216Introducing common validations and their implementation with PDI 219
Treating invalid data by splitting and merging streams 220Fixing data that doesn't match the rules 221
Summary 223
Chapter 8: Manipulating Data by Coding 225
Doing simple tasks with the JavaScript step 225Using the JavaScript language in PDI 226Inserting JavaScript code using the JavaScript step 226
Adding fields 229Modifying fields 230Organizing your code 231
Controlling the flow using predefined constants 233Testing the script using the Test script button 235
Parsing unstructured files with JavaScript 237Doing simple tasks with the Java Class step 239
Using the Java language in PDI 239Inserting Java code using the Java Class step 241
Learning to insert java code in a Java Class step 242Data types equivalence 244Adding fields 244Modifying fields 247Controlling the flow with the putRow() function 247
Testing the Java Class using the Test class button 249Getting the most out of the Java Class step 251
Receiving parameters 251Reading data from additional steps 252Redirecting data to different target steps 253Parsing JSON structures 254
Avoiding coding using purpose-built steps 255Summary 257
Chapter 9: Transforming the Dataset 258
Table of Contents
[ vi ]
Sorting data 258Sorting a dataset with the sort rows step 259
Working on groups of rows 262Aggregating data 262Summarizing the PDI steps that operate on sets of rows 268
Converting rows to columns 270Converting row data to column data using the Row denormaliser step 270Aggregating data with a Row Denormaliser step 277
Normalizing data 278Modifying the dataset with a Row Normaliser step 279
Going forward and backward across rows 282Picking rows backward and forward with the Analytic Query step 283
Summary 289
Chapter 10: Performing Basic Operations with Databases 290
Connecting to a database and exploring its content 290Connecting with Relational Database Management Systems 291Exploring a database with the Database Explorer 295
Previewing and getting data from a database 298Getting data from the database with the Table input step 298Using the Table input step to run flexible queries 300
Adding parameters to your queries 301Using Kettle variables in your queries 303
Inserting, updating, and deleting data 305Inserting new data into a database table 305Inserting or updating data with the Insert / Update step 307Deleting records of a database table with the Delete step 309Performing CRUD operations with more flexibility 310
Verifying a connection, running DDL scripts, and doing other usefultasks 311Looking up data in different ways 312
Doing simple lookups with the Database Value Lookup step 312Making a performance difference when looking up data in a database 316
Performing complex database lookups 316Looking for data using a Database join step 317Looking for data using a Dynamic SQL row step 320
Summary 321
Chapter 11: Loading Data Marts with PDI 323
Preparing the environment 323Exploring the Jigsaw database model 324
Table of Contents
[ vii ]
Creating the database and configuring the environment 325Introducing dimensional modeling 326Loading dimensions with data 327
Learning the basics of dimensions 327Understanding dimensions technical details 328
Loading a time dimension 329Introducing and loading Type I slowly changing dimensions 332
Loading a Type I SCD with a combination lookup/update step 332Introducing and loading Type II slowly changing dimension 336
Loading Type II SCDs with a dimension lookup/update step 336Loading a Type II SDC for the first time 338Loading a Type II SDC and verifying how history is kept 341
Explaining and loading Type III SCD and Hybrid SCD 343Loading other kinds of dimensions 344
Loading a mini dimension 344Loading junk dimensions 345Explaining degenerate dimensions 345
Loading fact tables 346Learning the basics about fact tables 346
Deciding the level of granularity 346Translating the business keys into surrogate keys 347
Obtaining the surrogate key for Type I SCD 348Obtaining the surrogate key for Type II SCD 349Obtaining the surrogate key for the junk dimension 350Obtaining the surrogate key for a time dimension 351
Loading a cumulative fact table 352Loading a snapshot fact table 353
Loading a fact table by inserting snapshot data 354Loading a fact table by overwriting snapshot data 354
Summary 355
Chapter 12: Creating Portable and Reusable Transformations 356
Defining and using Kettle variables 357Introducing all kinds of Kettle variables 357
Explaining predefined variables 357Revisiting the kettle.properties file 358Defining variables at runtime 358
Setting a variable with a constant value 358Setting a variable with a value unknown beforehand 360Setting variables with partial or total results of your flow 363
Defining and using named parameters 364Using variables as fields of your stream 366
Creating reusable Transformations 368Creating and executing sub-transformations 368
Table of Contents
[ viii ]
Creating and testing a sub-transformation 369Executing a sub-transformation 372
Introducing more elaborate sub-transformations 375Making the data flow between transformations 376
Transferring data using the copy/get rows mechanism 376Executing transformations in an iterative way 379
Using Transformation executors 379Configuring the executors with advanced settings 382
Getting the results of the execution of the inner transformation 382Working with groups of data 384Using variables and named parameters 385Continuing the flow after executing the inner transformation 385
Summary 387
Chapter 13: Implementing Metadata Injection 388
Introducing metadata injection 388Explaining how metadata injection works 389Creating a template Transformation 391Injecting metadata 392
Discovering metadata and injecting it 395Identifying use cases to implement metadata injection 400Summary 401
Chapter 14: Creating Advanced Jobs 402
Enhancing your processes with the use of variables 402Running nested jobs 402Understanding the scope of variables 403Using named parameters 404Using variables to create flexible processes 407
Using variables to name jobs and transformations 407Using variables to name Job and Transformation folders 409
Accessing copied rows for different purposes 410Using the copied rows to manage files in advanced ways 411Using the copied rows as parameters of a Job or Transformation 415
Working with filelists 417Maintaining a filelist 417Using the filelist for different purposes 418
Attaching files in an email 418Copying, moving, and deleting files 420Introducing other ways to process the filelist 421
Executing jobs in an iterative way 422Using Job executors 422
Table of Contents
[ ix ]
Configuring the executors with advanced settings 427Getting the results of the execution of the job 427Working with groups of data 429Using variables and named parameters 430Capturing the result filenames 430
Summary 432
Chapter 15: Launching Transformations and Jobs from the CommandLine 433
Using the Pan and Kitchen utilities 433Running jobs and transformations 434Checking the exit code 435
Supplying named parameters and variables 436Using command-line arguments 438
Deciding between the use of a command-line argument and namedparameters 442
Sending the output of executions to log files 443Automating the execution 446Summary 447
Chapter 16: Best Practices for Designing and Deploying a PDI Project 448
Setting up a new project 448Setting up the local environment 449Defining a folder structure for the project 450Dealing with external resources 451Defining and adopting a versioning system 452
Best practices to design jobs and transformations 453Styling your work 453Making the work portable 455Designing and developing reusable jobs and transformations 456
Maximizing the performance 456Analyzing Steps Metrics 457Analyzing performance graphs 458
Deploying the project in different environments 459Modifying the Kettle home directory 460
Modifying the Kettle home in Windows 461Modifying the Kettle home in Unix-like systems 462
Summary 462
Index 463