Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

29

description

Working with Hadoop does not always mean starting from scratch. In this session, you’ll learn how to leverage your existing investments in tools and skills to accelerate your Hadoop development. Learn from experts as they walk you step-by-step through the conversion of an existing ETL process to use Big Data.

Transcript of Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Page 1: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration
Page 2: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Fast-track Development for Big Data Integration

Page 3: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

3

Why do you need ETL?

3

Design & Develop

Protot

ype & Design

Test

Manage

Monitor, Troubleshoot, Secure & Retain

Define

&

Docum

ent

Iterate

Integrate & Transform

Analytic

Targets

Report

Cleanse

Sources

database

XML

Flat

Files

App

Under-

stand

ETL, ETL on Grid, ELT, Hadoop, Cloud, Real time,

Replication

Extract Load

Page 4: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

4

Let’s suppose…

• ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity

• Input data is • Market Data – Bloomberg daily stock price and volume for one year

• Customer Transactions (i.e. trades) – Stock purchases over last 5 years

• Twitter – Daily # of tweets for each stock symbol for one year

• Web Logs – Daily # of stock views for each customer for one year

• Output is • Customer Stock Recommendations – daily stock recommendations for each customer

4

Page 5: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

5

If you did this on your own What would you need to build? What skills are needed?

5

select id, from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id =

SQL <apex:page controller="RestDemoJsonController" tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> <apex:form > <apex:pageBlock > <apex:pageBlockButtons > <apex:commandButton action="{!submit}" value="Submit" rerender="resultsPanel" status="status"/> </apex:pageBlockButtons> <apex:pageMessages />

JSON

// String id, String name public static void GetLatLon_json_Request () { Http http = new Http(); HttpRequest req = new HttpRequest(); req.setEndpoint( ‘http://maps.google.com/maps/api/geocode/xml?address=torre+parquecristal+caracas+venezuela&sensor=true’); req.setMethod(‘GET’); HTTPResponse resp = http.send(req); String json; json = resp.getBody().replace(‘\n’, ”);

JAVA

open(DBFMT, $opt_database . ".fmt") || die "can't open format file:$opt_database",".fmt\n"; $docBegin="DOC"; $docEnd="\/DOC"; $idBegin="DOCNO"; $idEnd="\/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^\s*TITLE\s*:\s*([^\s]+)/) { $title{$1}=1;

PERL

What if something changes?

Page 6: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

6

Doing this on your own has challenges

• Time-consuming

• Requires specialized skills

• Hard to maintain, difficult to change

• No reuse

6

Page 7: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

7

There are alternative approaches…

7

Let’s see how this works with an Informatica Demo

Page 8: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

8

Challenges with traditional infrastructure

• Cannot cost-effectively scale as data volumes grow

• Not designed to support many new data types

• Does not support rapid agile development

• Analysis is not flexible to facilitate rapid discovery

8

Page 9: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

9

Maximize your return on big data

Transactions,

OLTP, OLAP

Social Media,

Web Logs

Documents,

Email

Machine Device,

Scientific

Data Warehouse

MDM

Operational Systems Analytical Systems Reports & Analytics

Data Mart

ODS

OLTP

OLTP

Access

& Ingest

Parse &

Prepare

Discover

& Profile

Transform

& Cleanse

Extract &

Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

Data Mart

Data Sources

Page 10: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

10

If you did this on your own What would you need to build? What skills are needed?

10

select id, from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id =

SQL <apex:page controller="RestDemoJsonController" tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> <apex:form > <apex:pageBlock > <apex:pageBlockButtons > <apex:commandButton action="{!submit}" value="Submit" rerender="resultsPanel" status="status"/> </apex:pageBlockButtons> <apex:pageMessages />

JSON

// String id, String name public static void GetLatLon_json_Request () { Http http = new Http(); HttpRequest req = new HttpRequest(); req.setEndpoint( ‘http://maps.google.com/maps/api/geocode/xml?address=torre+parquecristal+caracas+venezuela&sensor=true’); req.setMethod(‘GET’); HTTPResponse resp = http.send(req); String json; json = resp.getBody().replace(‘\n’, ”);

JAVA

open(DBFMT, $opt_database . ".fmt") || die "can't open format file:$opt_database",".fmt\n"; $docBegin="DOC"; $docEnd="\/DOC"; $idBegin="DOCNO"; $idEnd="\/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^\s*TITLE\s*:\s*([^\s]+)/) { $title{$1}=1;

PERL

HADOOP PIG

pv_by_industry = GROUP profile_view by viewee_industry_id pv_avg_by_industry = FOREACH pv_by_industry GENERATE group as viewee_industry_id, AVG(profie_view) AS average_pv;

INSERT OVERWRITE TABLE dog_food SELECT pv.*, u.brand, u.age, f.SKU FROM page_view pv JOIN user u ON (pv.id = u.id) JOIN breed_list f ON (u.id = f.uid) WHERE pv.date = '2013-02-26';

HIVE

MapReduce

public static void main(String[] args) throws Exception { job.setMapperClass(WordMapper.class); job.setInputFormatClass(KeyValueTextInputFormat.class); FileInputFormat.addInputPath(job, new Path("/tmp/hadoop-cscarioni/dfs/name/file")); FileOutputFormat.setOutputPath(job, new Path("output")); }

Page 11: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

11

Implement a proven path to innovation

11

Innovate Faster With Big Data (onboard, discover, operationalize)

Minimize Risk of New Technologies (design once, deploy anywhere)

Lower Big Data Project Costs (helps self-fund big data projects)

Page 12: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

12

Informatica + Cloudera: Lower Costs

12

Transactions,

OLTP, OLAP

Social Media, Web Logs

Machine Device,

Scientific

Documents and Emails

EDW

Data Mart

Data Mart

Optimize processing with low cost commodity hardware

Increase productivity up to 5X

Traditional Grid

Access

& Ingest

Parse &

Prepare

Discover

& Profile

Transform

& Cleanse

Extract &

Deliver

Page 13: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

13

Informatica + Cloudera: Minimize Risk

Quickly staff projects with trained data integration experts

Page 14: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

14

Informatica + Cloudera: Minimize Risk

Traditional Grid Deploy On-Premise or in

the Cloud Pushdown to RDBMS or DW

Appliance

Design once and deploy anywhere

Page 15: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

15

Informatica + Cloudera: Innovate Faster

15

Transactions,

OLTP, OLAP

Social Media, Web Logs

Machine Device,

Scientific

Documents and Emails

Analytics & Op Dashboards

Mobile Apps

Real-Time Alerts

Onboard and analyze any type of data to gain big data insights

Discover insights faster through rapid development and collaboration

Operationalize big data insights to generate new revenue streams

Page 16: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

How does Informatica + Cloudera do this?

16

Page 17: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

17

Maximize your return on big data

Transactions,

OLTP, OLAP

Social Media,

Web Logs

Documents,

Email

Machine Device,

Scientific

Data Warehouse

MDM

Operational Systems Analytical Systems Reports & Analytics

Data Mart

ODS

OLTP

OLTP

Access

& Ingest

Parse &

Prepare

Discover

& Profile

Transform

& Cleanse

Extract &

Deliver

Manage (i.e. Security, Performance, Governance, Collaboration)

Data Mart

Data Sources

Page 18: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

18

Data Ingestion and Extraction

18

18

Transactions,

OLTP, OLAP

Social Media, Web Logs

Machine Device,

Scientific

Documents and Emails

Data Warehouse

Applications

Data Mart

Batch

Replication

Streaming

Archiving

Deliver

Page 19: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

19

Integrate All Data: High Performance Data Access WebSphere MQ JMS MSMQ SAP NetWeaver XI

JD Edwards Lotus Notes Oracle E-Business PeopleSoft

Oracle DB2 UDB DB2/400 SQL Server Sybase

ADABAS Datacom DB2 IDMS IMS

Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP

Informix Teradata Netezza ODBC JDBC

VSAM C-ISAM Binary Flat Files Tape Formats…

Web Services TIBCO webMethods

SAP NetWeaver SAP NetWeaver BI SAS Siebel

Flat files ASCII reports HTML RPG ANSI LDAP

FIX, SWIFT

EDI–X12

EDI-Fact

HL7

HIPAA

ebXML

HL7 v3.0

ACORD (AL3, XML)

XML

LegalXML

IFX

cXML

NACHA

AST

RosettaNet

Cargo IMP

MVR

Salesforce CRM

Force.com

RightNow

NetSuite

ADP Hewitt SAP By Design Oracle OnDemand

Facebook Twitter LinkedIn

Kapow Datasift Teradata

AsterData EMC/Greenplum Vertica

Messaging &

Web Services

Relational & Flat

Files

Mainframe &

Midrange

Unstructured

Data & Files

MPP Appliances

Packaged

Applications

SaaS/BPO

Industry

Standards

XML Standards

Social Media

Page 20: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

20

Informatica ETL Execution on Hadoop

SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)

Hive HQL

1. Mapping translated and optimized to Hive HQL and User Defined Functions

2. Optimized HQL translated to MapReduce 3. MapReduce and User Defined Functions

executed on Cloudera

Data Node Data Node Data Nodes

UDF MapReduce

Informatica Data Transformation Engine

Page 21: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

21

Data Profiling & Discovery on Hadoop

21

Value and Pattern Frequency to Isolate Data Quality Issues

Discover Data Domains & Relationships

Including PII Data

Page 22: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Informatica + Cloudera Demo

22

Page 23: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

23

Informatica + Cloudera Demo Scenario

• ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity

• Input data is • Market Data – Bloomberg daily stock price and volume for 2012

• Customer Transactions (i.e. trades) – Stock purchases over last 5 years

• Twitter – Daily # of tweets for each stock symbol for 2012

• Web Logs – Daily # of stock views for each customer for 2012

• Output is • Customer Stock Recommendations – daily stock recommendations for each

customer available in a relational data warehouse.

Page 24: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

24

24

• Data Integration on Hadoop

• Data Quality and Profiling on Hadoop

• Data Parsing on Hadoop

• NLP & Entity Extraction on Hadoop

• Replication to Hadoop

• Archiving on Hadoop

Connect to HDFS

Transactions,

OLTP, OLAP Documents,

Email

Social Media,

Web Logs Machine Device,

Scientific

HDFS

Map Reduce

DataNode3

INFA Clients

Informatica Services

HDFS

Map Reduce

DataNode2

HDFS

Map Reduce

DataNode1

HDFS

Map Reduce

Namenode Job Tracker

Connect to Hive

Metadata

Repository

Dat

a A

cces

s

RDBMS

Page 25: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

Next Steps

25

Page 26: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

26

Transform Parse Cleanse Profile Match Archive

LOWER COSTS • OPTIMIZED END-TO-END DATA MANAGEMENT

PERFORMANCE ON HADOOP

• RICH PRE-BUILT LIBRARY OF ETL TRANSFORMS,

DATA QUALITY RULES, COMPLEX FILE PARSING,

& DATA PROFILING ON HADOOP

1

INCREASE PRODUCTIVITY • UP TO 5X PRODUCTIVITY GAINS WITH NO-CODE

VISUAL DEVELOPMENT AND MANAGEMENT 2

ACCELERATE ADOPTION • 500+ PARTNERS AND 100,000+ TRAINED

INFORMATICA DEVELOPERS

• 360+ PARTNERS AND 15,000+ TRAINED ON

CLOUDERA ANNUALLY ON 6 CONTINENTS

3

Page 27: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

27

Apply

Data Governance

Apply

Measure and

Monitor Define

Discover Transform Parse Cleanse Profile Match Archive

Page 28: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

28

What is the plan forward?

• tomorrow • Identify a business opportunity where data can have a significant impact

• Identify the skills you need to build a team with big data competencies

• 3 months • Identify and prioritize the data you need to improve the business (both internal and external)

• Determine what data to store in Cloudera to lower and control cost

• Put a business plan together to optimize your DW/BI infrastructure

• Execute a quick win big data project with demonstrable ROI

• 1 year • Extend data governance to include more data and more types of data that impacts the

business

• Consider a shared-services model to promote best practices and further lower infrastructure and labor costs

28

Page 29: Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration

29

Thank You! cloudera.com/clouderasessions