Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration
-
Upload
cloudera-inc -
Category
Technology
-
view
377 -
download
0
description
Transcript of Cloudera Sessions - Clinic 3 - Advanced Steps - Fast-track Development for ETL and Data Integration
Fast-track Development for Big Data Integration
3
Why do you need ETL?
3
Design & Develop
Protot
ype & Design
Test
Manage
Monitor, Troubleshoot, Secure & Retain
Define
&
Docum
ent
Iterate
Integrate & Transform
Analytic
Targets
Report
Cleanse
Sources
database
XML
Flat
Files
App
Under-
stand
ETL, ETL on Grid, ELT, Hadoop, Cloud, Real time,
Replication
Extract Load
4
Let’s suppose…
• ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity
• Input data is • Market Data – Bloomberg daily stock price and volume for one year
• Customer Transactions (i.e. trades) – Stock purchases over last 5 years
• Twitter – Daily # of tweets for each stock symbol for one year
• Web Logs – Daily # of stock views for each customer for one year
• Output is • Customer Stock Recommendations – daily stock recommendations for each customer
4
5
If you did this on your own What would you need to build? What skills are needed?
5
select id, from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id =
SQL <apex:page controller="RestDemoJsonController" tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> <apex:form > <apex:pageBlock > <apex:pageBlockButtons > <apex:commandButton action="{!submit}" value="Submit" rerender="resultsPanel" status="status"/> </apex:pageBlockButtons> <apex:pageMessages />
JSON
// String id, String name public static void GetLatLon_json_Request () { Http http = new Http(); HttpRequest req = new HttpRequest(); req.setEndpoint( ‘http://maps.google.com/maps/api/geocode/xml?address=torre+parquecristal+caracas+venezuela&sensor=true’); req.setMethod(‘GET’); HTTPResponse resp = http.send(req); String json; json = resp.getBody().replace(‘\n’, ”);
JAVA
open(DBFMT, $opt_database . ".fmt") || die "can't open format file:$opt_database",".fmt\n"; $docBegin="DOC"; $docEnd="\/DOC"; $idBegin="DOCNO"; $idEnd="\/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^\s*TITLE\s*:\s*([^\s]+)/) { $title{$1}=1;
PERL
What if something changes?
6
Doing this on your own has challenges
• Time-consuming
• Requires specialized skills
• Hard to maintain, difficult to change
• No reuse
6
7
There are alternative approaches…
7
Let’s see how this works with an Informatica Demo
8
Challenges with traditional infrastructure
• Cannot cost-effectively scale as data volumes grow
• Not designed to support many new data types
• Does not support rapid agile development
• Analysis is not flexible to facilitate rapid discovery
8
9
Maximize your return on big data
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Machine Device,
Scientific
Data Warehouse
MDM
Operational Systems Analytical Systems Reports & Analytics
Data Mart
ODS
OLTP
OLTP
Access
& Ingest
Parse &
Prepare
Discover
& Profile
Transform
& Cleanse
Extract &
Deliver
Manage (i.e. Security, Performance, Governance, Collaboration)
Data Mart
Data Sources
10
If you did this on your own What would you need to build? What skills are needed?
10
select id, from (select t2.id, t2.item_attr1, t2.total_sales, sum(item_purchase.bought_at) as total_purchase from (select t1.id, t1.item_attr1, sum(item_sales.sold_at) as total_sales from (select id, item_attr1 from item where ... ) as t1 left join item_sales on item.id =
SQL <apex:page controller="RestDemoJsonController" tabStyle="Contact"> <apex:sectionHeader title="Google Maps Geocoding" subtitle="REST Demo (JSON)"/> <apex:form > <apex:pageBlock > <apex:pageBlockButtons > <apex:commandButton action="{!submit}" value="Submit" rerender="resultsPanel" status="status"/> </apex:pageBlockButtons> <apex:pageMessages />
JSON
// String id, String name public static void GetLatLon_json_Request () { Http http = new Http(); HttpRequest req = new HttpRequest(); req.setEndpoint( ‘http://maps.google.com/maps/api/geocode/xml?address=torre+parquecristal+caracas+venezuela&sensor=true’); req.setMethod(‘GET’); HTTPResponse resp = http.send(req); String json; json = resp.getBody().replace(‘\n’, ”);
JAVA
open(DBFMT, $opt_database . ".fmt") || die "can't open format file:$opt_database",".fmt\n"; $docBegin="DOC"; $docEnd="\/DOC"; $idBegin="DOCNO"; $idEnd="\/DOCNO"; while (<DBFMT>) { print STDERR if $debug; if (/^\s*TITLE\s*:\s*([^\s]+)/) { $title{$1}=1;
PERL
HADOOP PIG
pv_by_industry = GROUP profile_view by viewee_industry_id pv_avg_by_industry = FOREACH pv_by_industry GENERATE group as viewee_industry_id, AVG(profie_view) AS average_pv;
INSERT OVERWRITE TABLE dog_food SELECT pv.*, u.brand, u.age, f.SKU FROM page_view pv JOIN user u ON (pv.id = u.id) JOIN breed_list f ON (u.id = f.uid) WHERE pv.date = '2013-02-26';
HIVE
MapReduce
public static void main(String[] args) throws Exception { job.setMapperClass(WordMapper.class); job.setInputFormatClass(KeyValueTextInputFormat.class); FileInputFormat.addInputPath(job, new Path("/tmp/hadoop-cscarioni/dfs/name/file")); FileOutputFormat.setOutputPath(job, new Path("output")); }
11
Implement a proven path to innovation
11
Innovate Faster With Big Data (onboard, discover, operationalize)
Minimize Risk of New Technologies (design once, deploy anywhere)
Lower Big Data Project Costs (helps self-fund big data projects)
12
Informatica + Cloudera: Lower Costs
12
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
EDW
Data Mart
Data Mart
Optimize processing with low cost commodity hardware
Increase productivity up to 5X
Traditional Grid
Access
& Ingest
Parse &
Prepare
Discover
& Profile
Transform
& Cleanse
Extract &
Deliver
13
Informatica + Cloudera: Minimize Risk
Quickly staff projects with trained data integration experts
14
Informatica + Cloudera: Minimize Risk
Traditional Grid Deploy On-Premise or in
the Cloud Pushdown to RDBMS or DW
Appliance
Design once and deploy anywhere
15
Informatica + Cloudera: Innovate Faster
15
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
Analytics & Op Dashboards
Mobile Apps
Real-Time Alerts
Onboard and analyze any type of data to gain big data insights
Discover insights faster through rapid development and collaboration
Operationalize big data insights to generate new revenue streams
How does Informatica + Cloudera do this?
16
17
Maximize your return on big data
Transactions,
OLTP, OLAP
Social Media,
Web Logs
Documents,
Machine Device,
Scientific
Data Warehouse
MDM
Operational Systems Analytical Systems Reports & Analytics
Data Mart
ODS
OLTP
OLTP
Access
& Ingest
Parse &
Prepare
Discover
& Profile
Transform
& Cleanse
Extract &
Deliver
Manage (i.e. Security, Performance, Governance, Collaboration)
Data Mart
Data Sources
18
Data Ingestion and Extraction
18
18
Transactions,
OLTP, OLAP
Social Media, Web Logs
Machine Device,
Scientific
Documents and Emails
Data Warehouse
Applications
Data Mart
Batch
Replication
Streaming
Archiving
Deliver
19
Integrate All Data: High Performance Data Access WebSphere MQ JMS MSMQ SAP NetWeaver XI
JD Edwards Lotus Notes Oracle E-Business PeopleSoft
Oracle DB2 UDB DB2/400 SQL Server Sybase
ADABAS Datacom DB2 IDMS IMS
Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP
Informix Teradata Netezza ODBC JDBC
VSAM C-ISAM Binary Flat Files Tape Formats…
Web Services TIBCO webMethods
SAP NetWeaver SAP NetWeaver BI SAS Siebel
Flat files ASCII reports HTML RPG ANSI LDAP
FIX, SWIFT
EDI–X12
EDI-Fact
HL7
HIPAA
ebXML
HL7 v3.0
ACORD (AL3, XML)
XML
LegalXML
IFX
cXML
NACHA
AST
RosettaNet
Cargo IMP
MVR
Salesforce CRM
Force.com
RightNow
NetSuite
ADP Hewitt SAP By Design Oracle OnDemand
Facebook Twitter LinkedIn
Kapow Datasift Teradata
AsterData EMC/Greenplum Vertica
Messaging &
Web Services
Relational & Flat
Files
Mainframe &
Midrange
Unstructured
Data & Files
MPP Appliances
Packaged
Applications
SaaS/BPO
Industry
Standards
XML Standards
Social Media
20
Informatica ETL Execution on Hadoop
SELECT T1.ORDERKEY1 AS ORDERKEY2, T1.li_count, orders.O_CUSTKEY AS CUSTKEY, customer.C_NAME, customer.C_NATIONKEY, nation.N_NAME, nation.N_REGIONKEY FROM ( SELECT TRANSFORM (L_Orderkey.id) USING CustomInfaTx FROM lineitem GROUP BY L_ORDERKEY ) T1 JOIN orders ON (customer.C_ORDERKEY = orders.O_ORDERKEY)
Hive HQL
1. Mapping translated and optimized to Hive HQL and User Defined Functions
2. Optimized HQL translated to MapReduce 3. MapReduce and User Defined Functions
executed on Cloudera
Data Node Data Node Data Nodes
UDF MapReduce
Informatica Data Transformation Engine
21
Data Profiling & Discovery on Hadoop
21
Value and Pattern Frequency to Isolate Data Quality Issues
Discover Data Domains & Relationships
Including PII Data
Informatica + Cloudera Demo
22
23
Informatica + Cloudera Demo Scenario
• ABC Bank is rolling out a new service to provide daily stock recommendations based on customers prior transaction history, propensity for risk, and stock popularity
• Input data is • Market Data – Bloomberg daily stock price and volume for 2012
• Customer Transactions (i.e. trades) – Stock purchases over last 5 years
• Twitter – Daily # of tweets for each stock symbol for 2012
• Web Logs – Daily # of stock views for each customer for 2012
• Output is • Customer Stock Recommendations – daily stock recommendations for each
customer available in a relational data warehouse.
24
24
• Data Integration on Hadoop
• Data Quality and Profiling on Hadoop
• Data Parsing on Hadoop
• NLP & Entity Extraction on Hadoop
• Replication to Hadoop
• Archiving on Hadoop
Connect to HDFS
Transactions,
OLTP, OLAP Documents,
Social Media,
Web Logs Machine Device,
Scientific
HDFS
Map Reduce
DataNode3
INFA Clients
Informatica Services
HDFS
Map Reduce
DataNode2
HDFS
Map Reduce
DataNode1
HDFS
Map Reduce
Namenode Job Tracker
Connect to Hive
Metadata
Repository
Dat
a A
cces
s
RDBMS
Next Steps
25
26
Transform Parse Cleanse Profile Match Archive
LOWER COSTS • OPTIMIZED END-TO-END DATA MANAGEMENT
PERFORMANCE ON HADOOP
• RICH PRE-BUILT LIBRARY OF ETL TRANSFORMS,
DATA QUALITY RULES, COMPLEX FILE PARSING,
& DATA PROFILING ON HADOOP
1
INCREASE PRODUCTIVITY • UP TO 5X PRODUCTIVITY GAINS WITH NO-CODE
VISUAL DEVELOPMENT AND MANAGEMENT 2
ACCELERATE ADOPTION • 500+ PARTNERS AND 100,000+ TRAINED
INFORMATICA DEVELOPERS
• 360+ PARTNERS AND 15,000+ TRAINED ON
CLOUDERA ANNUALLY ON 6 CONTINENTS
3
27
Apply
Data Governance
Apply
Measure and
Monitor Define
Discover Transform Parse Cleanse Profile Match Archive
28
What is the plan forward?
• tomorrow • Identify a business opportunity where data can have a significant impact
• Identify the skills you need to build a team with big data competencies
• 3 months • Identify and prioritize the data you need to improve the business (both internal and external)
• Determine what data to store in Cloudera to lower and control cost
• Put a business plan together to optimize your DW/BI infrastructure
• Execute a quick win big data project with demonstrable ROI
• 1 year • Extend data governance to include more data and more types of data that impacts the
business
• Consider a shared-services model to promote best practices and further lower infrastructure and labor costs
28
29
Thank You! cloudera.com/clouderasessions