Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

Post on 15-Apr-2017

1.021 views 0 download

Transcript of Pentaho Data Integration: Extrayendo, integrando, normalizando y preparando mis datos

TALLERPentaho Data Integration: Extrayendo, Integrando,

Normalizando y Preparando mis datos

Proyectos Programa Big Data y Business Intelligence

Alex Rayónalex.rayon@deusto.es

Noviembre, 2015

Before starting….

Who has used a

relational database? Source: http://www.agiledata.org/essays/databaseTesting.html

2

Before starting…. (II)

Who has written scripts or Java code to move data from one

source and load it to another?

Source: http://www.theguardian.com/teacher-network/2012/jan/10/how-to-teach-code

3

Before starting…. (III)

What did you use?

1.Scripts

2.Custom Java Code

3.ETL4

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

5

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

6

Pentaho at a glance

Business Intelligence

7

Pentaho at a glance (II)

8

Pentaho at a glance (III)Business Intelligence & Analytics

Open Core

GPL v2

Apache 2.0

Enterprise and OEM licenses

Java-based

Web front-ends9

Pentaho at a glance (IV)The Pentaho Stack

Data Integration / ETL

Big Data / NoSQL

Data Modeling

Reporting

OLAP / Analysis

Data Visualization

Dashboarding

Data Mining / Predictive Analysis

Scheduling

Source: http://helicaltech.com/blogs/hire-pentaho-consultants-hire-pentaho-developers/

10

Pentaho at a glance (V)Modules

Pentaho Data Integration

Kettle

Pentaho Analysis

Mondrian

Pentaho Reporting

Pentaho Dashboards

Pentaho Data Mining

WEKA

11

Pentaho at a glance (VI)

Figures

+ 10.000 deployments

+ 185 countries

+ 1.200 customers

Since 2012, in Gartner Magic Quadrant for BI Platforms

1 download / 30 seconds

12

Pentaho at a glance (VII)

Open Source Leader

13

Pentaho at a glance (VIII)Single Platform

14

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

15

Academic field

16

Academic field (II)

17

Academic field (III)

18

Academic field (IV)

19

Academic field (V)

20

Academic field (VI)

21

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

22

ETLDefinition and characteristics

An ETL tool is a tool that

Extracts data from various data sources (usually legacy data)

Transforms data

from → being optimized for transaction

to → being optimized for reporting and analysis

synchronizes the data coming from different databases

data cleanses to remove errors

Loads data into a data warehouse23

ETLWhy do I need it?

ETL tools save time and money when developing a data warehouse by removing the need for hand-coding

It is very difficult for database administrators to connect between different brands of databases without using an external tool

In the event that databases are altered or new databases need to be integrated, a lot of hand-coded work needs to be completely redone

24

ETLBusiness Intelligence

ETL is the heart and soul of business intelligence (BI)

ETL processes bring together and combine data from multiple source systems into a data warehouse

Source: http://datawarehouseujap.blogspot.com.es/2010/08/data-warehouse.html

25

ETLBusiness Intelligence (II)

According to most practitioners, ETL

design and development work consumes 60 to

80 percent of an entire BI project

Source: http://www.dwuser.com/news/tag/optimization/

Source: The Data Warehousing Institute. www.dw-institute.com

26

ETLProcessing framework

Source: The Data Warehousing Institute. www.dw-institute.com

27

ETLTools

Source: http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01

28

ETLOpen Source tools

CloverETL

KETL

Kettle

Talend

29

ETLCloverETL

Create a basic archive of functions for mapping and transformations, allowing companies to move large amounts of data as quickly and efficiently as possible

Uses building blocks called components to create a transformation graph, which is a visual depiction of the intended data processing

30

ETLCloverETL (II)

The graphic presentation simplifies even complex data transformations, allowing for drag-and-drop functionality

Limited to approximately 40 different components to simplify graph creation

Yet you may configure each component to meet specific needs

It also features extensive debugging capabilities to ensure all transformation graphs work precisely as intended

31

ETLKETL

Contains a scalable, platform-independent engine capable of supporting multiple computers and 64-bit servers

The program also offers performance monitoring, extensive data source support, XML compatibility and a scheduling engine for time-based and event-driven job execution

32

ETLKettle

The Pentaho company produced Kettle as an OS alternative to commercial ETL software

No relation to Kinetic Networks' KETL

Kettle features a drop-and-drag, graphical environment with progress feedback for all data transactions, including automatic documentation of executed jobs

XML Input Stream to handle huge XML files without suffering a loss in performance or a spike in memory usage

Users can also upgrade the free Kettle version for optional pay features and dedicated technical support.

33

ETLTalend

Provides a graphical environment for data integration, migration and synchronization

Drag and drop graphic components to create the java code required to execute the desired task, saving time and effort

Pre-built connectors to enable compatibility with a wide range of business systems and databases

Users gain real-time access to corporate data, allowing for the monitoring and debugging of transactions to ensure smooth data integration

34

ETLComparison

The set of criteria that were used for the ETL tools comparison were divided into seven categories:

TCO

Risk

Ease of use

Support

Deployment

Speed

Data Quality

Monitoring

Connectivity

35

ETLComparison (II)

36

ETLComparison (III)

Total Cost of Ownership

The overall cost for a certain product.

This can mean initial ordering, licensing servicing, support, training, consulting, and any other additional payments that need to be made before the product is in full use

Commercial Open Source products are typically free to use, but the support, training and consulting are what companies need to pay for

37

ETLComparison (IV)

Risk

There are always risks with projects, especially big projects.

The risks for projects failing are:

Going over budget

Going over schedule

Not completing the requirements or expectations of the customers

Open Source products have much lower risk then Commercial ones since they do not restrict the use of their products by pricey licenses

38

ETLComparison (V)

Ease of use

All of the ETL tools, apart from Inaport, have GUI to simplify the development process

Having a good GUI also reduces the time to train and use the tools

Pentaho Kettle has an easy to use GUI out of all the tools

Training can also be found online or within the community

39

ETLComparison (VI)

Support

Nowadays, all software products have support and all of the ETL tool providers offer support

Pentaho Kettle – Offers support from US, UK and has a partner consultant in Hong Kong

Deployment

Pentaho Kettle is a stand-alone java engine that can run on any machine that can run java. Needs an external scheduler to run automatically.

It can be deployed on many different machines and used as “slave servers” to help with transformation processing.

Recommended one 1Ghz CPU and 512mbs RAM

40

ETLComparison (VII)

Speed

The speed of ETL tools depends largely on the data that needs to be transferred over the network and the processing power involved in transforming the data.

Pentaho Kettle is faster than Talend, but the Java-connector slows it down somewhat. Also requires manual tweaking like Talend. Can be clustered by placed on many machines to reduce network traffic

41

ETLComparison (VIII)

Data Quality

Data Quality is fast becoming the most important feature in any data integration tool.

Pentaho – has DQ features in its GUI, allows for customized SQL statements, by using JavaScript and Regular Expressions. It also has some additional modules after subscribing.

Monitoring

Pentaho Kettle – has practical monitoring tools and logging

42

ETLComparison (IX)

Connectivity

In most cases, ETL tools transfer data from legacy systems

Their connectivity is very important to the usefulness of the ETL tools.

Kettle can connect to a very wide variety of databases, flat files, xml files, excel files and web services.

43

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

44

KettleIntroduction

Project Kettle

Powerful Extraction, Transformation and Loading (ETL) capabilities using an

innovative, metadata-driven approach

45

KettleIntroduction (II)

What is Kettle?

Batch data integration and processing tool written in Java

Exists to retrieve, process and load data

PDI is a synonymous term

Source: http://www.dreamstime.com/stock-photo-very-old-kettle-isolated-image16622230

46

KettleIntroduction (III)

It uses an innovative meta-driven approach

It has a very easy-to-use GUI

Strong community of 13,500 registered users

It uses a stand-alone Java engine that process the tasks for moving data between many different databases and files

47

KettleIntroduction (IV)

48

KettleData Integration Platform

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

49

KettleArchitecture

Source: Pentaho Corporation

50

KettleMost common uses

Datawarehouse and datamart loads

Data Integration

Data cleansing

Data migration

Data export

etc.51

KettleData Integration

Changing input to desired output

Jobs

Synchronous workflow of job entries (tasks)

Transformations

Stepwise parallel & asynchronous processing of a recordstream

Distributed

52

KettleData Integration challenges

Data is everywhere

Data is inconsistent

Records are different in each system

Performance issues

Running queries to summarize data for stipulated long period takes operating system for task

Brings the OS on max load

Data is never all in Data Warehouse

Excel sheet, acquisition, new application

53

KettleTransformations

String and Date Manipulation

Data Validation / Business Rules

Lookup / Join

Calculation, Statistics

Cryptography

Decisions, Flow control

Scripting

etc.

54

KettleWhat is good for?

Mirroring data from master to slave

Syncing two data sources

Processing data retrieved from multiple sources and pushed to multiple destinations

Loading data to RDBMS

Datamart / Datawarehouse

Dimension lookup/update step

Graphical manipulation of data

55

KettleAlternatives

56

Code

Custom java

Spring batch

Scripts

perl, python, shell, etc

Possibly + db loader tool and cron

Commercial ETL tools

Datastage

Informatica

Oracle Warehouse Builder

SQL Server Integration services

KettleExtraction

57

KettleExtraction (II)

Source: http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

58

KettleExtraction (III)

RDBMS (SQL Server, DB2, Oracle, MySQL, PostgreSQL, Sybase IQ, etc.)

NoSQL Data: HBase, Cassandra, MongoDB

OLAP (Mondrian, Palo, XML/A)

Web (REST, SOAP, XML, JSON)

Files (CSV, Fixed, Excel, etc.)

ERP (SAP, Salesforce, OpenERP)

Hadoop Data: HDFS, Hive

Web Data: Twitter, Facebook, Log Files, Web Logs

Others: LDAP/Active Directory, Google Analytics, etc.

59

KettleTransportation

60

KettleTransformation

61

KettleLoading

62

KettleEnvironment

63

KettleComparison of Data Integration tools

64

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

65

Big DataBusiness Intelligente

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

A brief (BI) history….

66

Big DataWEKA

Project WekaA comprehensive set of tools for Machine

Learning and Data Mining

Source: http://es.wikipedia.org/wiki/Weka_(aprendizaje_autom%C3%A1tico)

67

Big DataAmong Pentaho’s products

Mondrian

OLAP server written in Java

Kettle

ETL tool

Weka

Machine learning and Data Mining tool68

Big DataWEKA platform

WEKA (Waikato Environment for Knowledge Analysis)

Funded by the New Zealand’s Government (for more than 10 years)

Develop an open-source state-of-the-art workbench of data mining tools

Explore fielded applications

Develop new fundamental methods

Became part of Pentaho platform in 2006 (PDM - Pentaho Data Mining)

69

Big DataData Mining with WEKA

(One-of-the-many) Definition: Extraction of implicit, previously unknown, and potentially useful information from data

Goal: improve marketing, sales, and customer support operations, risk assessment etc.

Who is likely to remain a loyal customer?

What products should be marketed to which prospects?

What determines whether a person will respond to a certain offer?

How can I detect potential fraud?

70

Big DataData Mining with WEKA (II)

Central idea: historical data contains information that will be useful in the future (patterns → generalizations)

Data Mining employs a set of algorithms that automatically detect

patterns and regularities in data71

Big DataData Mining with WEKA (III)

A bank’s case as an example

Problem: Prediction (Probability Score) of a Corporate Customer Delinquency (or default) in the next year

Customer historical data used include:

Customer footings behavior (assets & liabilities)

Customer delinquencies (rates and time data)

Business Sector behavioral data

72

Big DataData Mining with WEKA (IV)

Variable selection using the Information Value (IV) criterion

Automatic Binning of continuous data variables was used (Chi-merge). Manual corrections were made to address particularities in the data distribution of some variables (using again IV)

73

Big DataData Mining with WEKA (V)

74

Big DataData Mining with WEKA (VI)

75

Big DataData Mining with WEKA (VII)

Limitations

Traditional algorithms need to have all data in (main) memory

big datasets are an issue

Solution

Incremental schemes

Stream algorithms

MOA (Massive Online Analysis)

http://moa.cs.waikato.ac.nz/76

Big DataBe careful with Data Mining

77

Table of ContentsPentaho at a glance

In the academic field

ETL

Kettle

Big Data

Predictive Analytics

78

Predictive analyticsUnified solution for Big Data Analytics

79

Predictive analyticsUnified solution for Big Data Analytics (II)

Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data discovery for iPad● Full analytical power on

the go – unique to Pentaho

● Mobile-optimized user interface

80

Predictive analyticsUnified solution for Big Data Analytics (III)

Curren release: Pentaho Business Analytics Suite 4.8

Instant and interactive data discovery and development for big data● Broadens big data access to

data analysts● Removes the need for

separate big data visualization tools

● Further improves productivity for big data developers

81

Predictive analyticsUnified solution for Big Data Analytics (IV)

Pentaho Instaview● Instaview is simple

○ Created for data analysts○ Dramatically simplifies ways to

access Hadoop and NoSQL data stores

● Instaview is instant & interactive○ Time accelerator – 3 quick steps from

data to analytics○ Interact with big data sources –

group, sort, aggregate & visualize● Instaview is big data analytics

○ Marketing analysis for weblog data in Hadoop

○ Application log analysis for data in MongoDB

82

Predictive analyticsComparison

Source: http://cdn.oreillystatic.com/en/assets/1/event/100/Using%20R%20and%20Hadoop%20for%20Statistical%20Computation%20at%20Scale%20Presentation.htm#/2

83

Referenceshttp://cdn.oreillystatic.com/en/assets/1/event/100/Big%20Data%20Architectural%20Patterns%20Presentation.pdf

http://blog.pentaho.com/tag/strata/

http://www.slideshare.net/mattcasters/pentaho-data-integration-introduction?from_search=2

http://www.slideshare.net/infoaxon/open-source-bi-7640848

http://download.101com.com/tdwi/research_report/2003ETLReport.pdf

http://www.slideshare.net/jade_22/kettleetltool-090522005630phpapp01

http://www.pentaho.com/Blend-of-the-Week?mkt_tok=3RkMMJWWfF9wsRonuKvNce%2FhmjTEU5z17%2BQoXaO2hokz2EFye%2BLIHETpodcMTcdgPbjYDBceEJhqyQJxPr3DJNAN1dt%2BRhDhCA%3D%3D#Analytics

84

Copyright (c) 2015 University of DeustoThis work (but the quoted images, whose rights are reserved to their owners*) is licensed under the Creative

Commons “Attribution-ShareAlike” License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/

Alex RayónNoviembre 2015

TALLERPentaho Data Integration: Extrayendo, Integrando,

Normalizando y Preparando mis datos

Proyectos Programa Big Data y Business Intelligence

Alex Rayónalex.rayon@deusto.es

Noviembre, 2015