Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Fast-track Development for Big Data Integration

3

Why do you need ETL?

3

Design & Develop

Protot

ype & Design

Test

Manage

Monitor, Troubleshoot, Secure & Retain

Define

&

Docum

ent

Iterate

Integrate & Transform

Analytic

Targets

Report

Cleanse

Sources

database

XML

Flat

Files

App

Under-

stand

ETL, ETL on Grid, ELT, Hadoop, Cloud, Real time,

Replication

Extract Load

4

Let’s suppose…

• ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity

• Input data is • Market Data – Bloomberg daily stock price and volume for one year

• Customer Transactions (i.e. trades) – Stock purchases over last 5 years

• Twitter – Daily # of tweets for each stock symbol for one year

• Web Logs – Daily # of stock views for each customer for one year

• Output is • Customer Stock Recommendations – daily stock recommendations for each customer

4

5

If you did this on your own What would you need to build? What skills are needed?

5

select id, from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id =

SQL <apex:page controller="RestDemoJsonController" tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> <apex:form > <apex:pageBlock > <apex:pageBlockButtons > <apex:commandButton action="{!submit}" value="Submit" rerender="resultsPanel" status="status"/> </apex:pageBlockButtons> <apex:pageMessages />

JSON

// String id, String name public static void GetLatLon_json_Request () { Http http = new Http(); HttpRequest req = new HttpRequest(); req.setEndpoint( ‘http://maps.google.com/maps/api/geocode/xml?address=torre+parquecristal+caracas+venezuela&sensor=true’); req.setMethod(‘GET’); HTTPResponse resp = http.send(req); String json; json = resp.getBody().replace(‘\n’, ”);

JAVA

open(DBFMT, $opt_database . ".fmt") || die "can't open format file:$opt_database",".fmt\n"; $docBegin="DOC"; $docEnd="\/DOC"; $idBegin="DOCNO"; $idEnd="\/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^\s*TITLE\s*:\s*([^\s]+)/) { $title{$1}=1;

PERL

What if something changes?

6

Doing this on your own has challenges

• Time-consuming

• Requires specialized skills

• Hard to maintain, difficult to change

• No reuse

6

7

There are alternative approaches…

7

Let’s see how this works with an Informatica Demo

8

Challenges with traditional infrastructure

• Cannot cost-effectively scale as data volumes grow

• Not designed to support many new data types

• Does not support rapid agile development

• Analysis is not flexible to facilitate rapid discovery

8

9

Maximize your return on big data

Transactions,

OLTP, OLAP

Social Media,

Web Logs

Documents,

Email

Machine Device,

Scientific

Data Warehouse

MDM

Operational Systems Analytical Systems Reports & Analytics

Data Mart

ODS

OLTP

OLTP

Access

& Ingest

Parse &

Prepare

Discover

& Profile

Transform

& Cleanse

Extract &

Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

Data Mart

Data Sources

10

If you did this on your own What would you need to build? What skills are needed?

10

select id, from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id =

SQL <apex:page controller="RestDemoJsonController" tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> <apex:form > <apex:pageBlock > <apex:pageBlockButtons > <apex:commandButton action="{!submit}" value="Submit" rerender="resultsPanel" status="status"/> </apex:pageBlockButtons> <apex:pageMessages />

JSON

// String id, String name public static void GetLatLon_json_Request () { Http http = new Http(); HttpRequest req = new HttpRequest(); req.setEndpoint( ‘http://maps.google.com/maps/api/geocode/xml?address=torre+parquecristal+caracas+venezuela&sensor=true’); req.setMethod(‘GET’); HTTPResponse resp = http.send(req); String json; json = resp.getBody().replace(‘\n’, ”);

JAVA

open(DBFMT, $opt_database . ".fmt") || die "can't open format file:$opt_database",".fmt\n"; $docBegin="DOC"; $docEnd="\/DOC"; $idBegin="DOCNO"; $idEnd="\/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^\s*TITLE\s*:\s*([^\s]+)/) { $title{$1}=1;

PERL

HADOOP PIG

pv_by_industry = GROUP profile_view by viewee_industry_id pv_avg_by_industry = FOREACH pv_by_industry GENERATE group as viewee_industry_id, AVG(profie_view) AS average_pv;

INSERT OVERWRITE TABLE dog_food SELECT pv.*, u.brand, u.age, f.SKU FROM page_view pv JOIN user u ON (pv.id = u.id) JOIN breed_list f ON (u.id = f.uid) WHERE pv.date = '2013-02-26';

HIVE

MapReduce

public static void main(String[] args) throws Exception { job.setMapperClass(WordMapper.class); job.setInputFormatClass(KeyValueTextInputFormat.class); FileInputFormat.addInputPath(job, new Path("/tmp/hadoop-cscarioni/dfs/name/file")); FileOutputFormat.setOutputPath(job, new Path("output")); }

11

Implement a proven path to innovation

11

Innovate Faster With Big Data (onboard, discover, operationalize)

Minimize Risk of New Technologies (design once, deploy anywhere)

Lower Big Data Project Costs (helps self-fund big data projects)

12

Informatica + Cloudera: Lower Costs

12

Transactions,

OLTP, OLAP

Social Media, Web Logs

Machine Device,

Scientific

Documents and Emails

EDW

Data Mart

Data Mart

Optimize processing with low cost commodity hardware

Increase productivity up to 5X

Traditional Grid

Access

& Ingest

Parse &

Prepare

Discover

& Profile

Transform

& Cleanse

Extract &

Deliver

13

Informatica + Cloudera: Minimize Risk

Quickly staff projects with trained data integration experts

14

Informatica + Cloudera: Minimize Risk

Traditional Grid Deploy On-Premise or in

the Cloud Pushdown to RDBMS or DW

Appliance

Design once and deploy anywhere

15

Informatica + Cloudera: Innovate Faster

15

Transactions,

OLTP, OLAP


Machine Device,

Scientific


Analytics & Op Dashboards

Mobile Apps

Real-Time Alerts

Onboard and analyze any type of data to gain big data insights

Discover insights faster through rapid development and collaboration

Operationalize big data insights to generate new revenue streams

How does Informatica + Cloudera do this?

16

17

Maximize your return on big data

Transactions,

OLTP, OLAP

Social Media,

Web Logs

Documents,

Email

Machine Device,

Scientific

Data Warehouse

MDM

Operational Systems Analytical Systems Reports & Analytics

Data Mart

ODS

OLTP

OLTP

Access

& Ingest

Parse &

Prepare

Discover

& Profile

Transform

& Cleanse

Extract &

Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

Data Mart

Data Sources

18

Data Ingestion and Extraction

18

18

Transactions,

OLTP, OLAP


Machine Device,

Scientific


Data Warehouse

Applications

Data Mart

Batch

Replication

Streaming

Archiving

Deliver

19

Integrate All Data: High Performance Data Access WebSphere MQ JMS MSMQ SAP NetWeaver XI

JD Edwards Lotus Notes Oracle E-Business PeopleSoft

Oracle DB2 UDB DB2/400 SQL Server Sybase

ADABAS Datacom DB2 IDMS IMS

Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP

Informix Teradata Netezza ODBC JDBC

VSAM C-ISAM Binary Flat Files Tape Formats…

Web Services TIBCO webMethods

SAP NetWeaver SAP NetWeaver BI SAS Siebel

Flat files ASCII reports HTML RPG ANSI LDAP

FIX, SWIFT

EDI–X12

EDI-Fact

HL7

HIPAA

ebXML

HL7 v3.0

ACORD (AL3, XML)

XML

LegalXML

IFX

cXML

NACHA

AST

RosettaNet

Cargo IMP

MVR

Salesforce CRM

Force.com

RightNow

NetSuite

ADP Hewitt SAP By Design Oracle OnDemand

Facebook Twitter LinkedIn

Kapow Datasift Teradata

AsterData EMC/Greenplum Vertica

Messaging &

Web Services

Relational & Flat

Files

Mainframe &

Midrange

Unstructured

Data & Files

MPP Appliances

Packaged

Applications

SaaS/BPO

Industry

Standards

XML Standards

Social Media

20

Informatica ETL Execution on Hadoop

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)

Hive HQL

1. Mapping translated and optimized to Hive HQL and User Defined Functions

2. Optimized HQL translated to MapReduce 3. MapReduce and User Defined Functions

executed on Cloudera

Data Node Data Node Data Nodes

UDF MapReduce

Informatica Data Transformation Engine

21

Data Profiling & Discovery on Hadoop

21

Value and Pattern Frequency to Isolate Data Quality Issues

Discover Data Domains & Relationships

Including PII Data

Informatica + Cloudera Demo

22

23

Informatica + Cloudera Demo Scenario

• ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity

• Input data is • Market Data – Bloomberg daily stock price and volume for 2012

• Customer Transactions (i.e. trades) – Stock purchases over last 5 years

• Twitter – Daily # of tweets for each stock symbol for 2012

• Web Logs – Daily # of stock views for each customer for 2012

• Output is • Customer Stock Recommendations – daily stock recommendations for each

customer available in a relational data warehouse.

24

24

• Data Integration on Hadoop

• Data Quality and Profiling on Hadoop

• Data Parsing on Hadoop

• NLP & Entity Extraction on Hadoop

• Replication to Hadoop

• Archiving on Hadoop

Connect to HDFS

Transactions,

OLTP, OLAP Documents,

Email

Social Media,

Web Logs Machine Device,

Scientific

HDFS

Map Reduce

DataNode3

INFA Clients

Informatica Services

HDFS

Map Reduce

DataNode2

HDFS

Map Reduce

DataNode1

HDFS

Map Reduce

Namenode Job Tracker

Connect to Hive

Metadata

Repository

Dat

a A

cces

s

RDBMS

Next Steps

25

26

Transform Parse Cleanse Profile Match Archive

LOWER COSTS • OPTIMIZED END-TO-END DATA MANAGEMENT

PERFORMANCE ON HADOOP

• RICH PRE-BUILT LIBRARY OF ETL TRANSFORMS,

DATA QUALITY RULES, COMPLEX FILE PARSING,

& DATA PROFILING ON HADOOP

1

INCREASE PRODUCTIVITY • UP TO 5X PRODUCTIVITY GAINS WITH NO-CODE

VISUAL DEVELOPMENT AND MANAGEMENT 2

ACCELERATE ADOPTION • 500+ PARTNERS AND 100,000+ TRAINED

INFORMATICA DEVELOPERS

• 360+ PARTNERS AND 15,000+ TRAINED ON

CLOUDERA ANNUALLY ON 6 CONTINENTS

3

27

Apply

Data Governance

Apply

Measure and

Monitor Define

Discover Transform Parse Cleanse Profile Match Archive

28

What is the plan forward?

• tomorrow • Identify a business opportunity where data can have a significant impact

• Identify the skills you need to build a team with big data competencies

• 3 months • Identify and prioritize the data you need to improve the business (both internal and external)

• Determine what data to store in Cloudera to lower and control cost

• Put a business plan together to optimize your DW/BI infrastructure

• Execute a quick win big data project with demonstrable ROI

• 1 year • Extend data governance to include more data and more types of data that impacts the

business

• Consider a shared-services model to promote best practices and further lower infrastructure and labor costs

28

29

Thank You! cloudera.com/clouderasessions

Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Technology

Transcript of Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration