Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle...
Transcript of Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle...
![Page 1: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/1.jpg)
1 © Copyright 2012 EMC Corporation. All rights reserved.
EMC Greenplum
Wolfgang Disselhoff Sr. Technology Architect, Greenplum André Münger Sr. Account Manager, Greenplum
Big Data meets Big Integration
![Page 2: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/2.jpg)
2 © Copyright 2012 EMC Corporation. All rights reserved.
![Page 3: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/3.jpg)
3 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM DATABASE
Industry-Leading
Massively Parallel
Processing (MPP)
Performance
![Page 4: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/4.jpg)
4 © Copyright 2012 EMC Corporation. All rights reserved.
Overview MPP Systems / Must-Haves
High Availability, Fault tolerance, Failover, Backup and Recovery
Simpler Architecture needed
Decreased maintenance
Performance improvements (by factors for Load and Enduser)
Highly improved cost / performance ration
Linear, „unlimited“ scalability
Fully supporting semi- and unstructured data (e.g. Hadoop)
Rich Ecosystem (Interfaces / Partner / Migration)
Company stability / strategy
![Page 5: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/5.jpg)
5 © Copyright 2012 EMC Corporation. All rights reserved.
Extreme Performance for Analytics
Optimized for BI and analytics
– Deep integration with statistical packages
– High performance parallel implementations
• Simple and automatic
– Just load and query like any database
– Tables are automatically distributed across nodes
• Extremely scalable
– MPP shared-nothing architecture
– All nodes can scan and process in parallel
– Linear scalability by adding nodes
GREENPLUM DATABASE
![Page 6: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/6.jpg)
6 © Copyright 2012 EMC Corporation. All rights reserved.
Segment Servers
Query processing & data storage
... ...
Master Server
Query planning & dispatch
Hadoop MapReduce
Data Sources
Loading, streaming, etc.
Network Interconnect
External Files, URLs, Hadoop (HDFS), WebServices (including from other DBs),
O/S Pipes (including from other DBs)
Standard Business Intelligence and Analytical tools
SQL BI tools
Analytical tools
Queries distributed across all available
resources
Shared Nothing, Massively Parallel Processing means
no bottlenecks and linear scalability.
Data loading also takes advantage of MPP architecture
Greenplum handles structured, semi-
structured and unstructured data
Clients sees a single database
and “new gen” tools: R, rattle, perl, python, PL/java etc
primary server, plus hot failover
Massively Parallel Processing And Linear Performance Scalability
![Page 7: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/7.jpg)
7 © Copyright 2012 EMC Corporation. All rights reserved.
High Availability
Master
Segment Segment Segment Segment
Master
Master Server Data Protection Replicated transaction logs for server failure
Optional RAID protection for drive failures
Upon server failure
Standby server activated
Administrator alerted
Orchestrated failover
Segment Server Data Protection Mirrored segments for server failures
Optional RAID protection for drive failures
Upon server failure
Mirrored segments take over with no loss of
service
Fast online differential recovery
GREENPLUM DATABASE
![Page 8: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/8.jpg)
8 © Copyright 2012 EMC Corporation. All rights reserved.
Migration using Informatica
BI T
ools
Reportin
g
Analy
tics
Data
In
tegra
tion
Oracle Environment
EMC GP DCA
Oracle Data
Warehouse
• Complete set of migration best practices
• Simple 3rd-party reporting and analytic tool integration
• Greenplum training specifically for Oracle DBAs
• High speed data export, transformation, and load
• Support for proprietary Oracle SQL functions for continuing use of Oracle DB talent
![Page 9: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/9.jpg)
9 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM HD
Delivering Enterprise-Ready Apache Hadoop
![Page 10: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/10.jpg)
10 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum HD Architecture G
REEN
PLU
M C
OM
MA
ND
CEN
TER
Pluggable Storage Layer (HDFS API)
MapReduce Layer
Hadoop Tools (Pig, Hive, HBase, Zookeeper, Mahout, etc…)
Apache HDFS
Greenplum Chorus
Isilon OneFS
GREENPLUM HD
![Page 11: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/11.jpg)
11 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM CHORUS
The World’s First Agile Analytics Productivity Platform
![Page 12: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/12.jpg)
12 © Copyright 2012 EMC Corporation. All rights reserved.
Agile Analytics
Faster, Easier with Chorus
Project Workspaces
Data Analysis
Publish and
Iterate
Explore the
Data
Collaboration
GREENPLUM CHORUS
![Page 13: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/13.jpg)
13 © Copyright 2012 EMC Corporation. All rights reserved.
Collaborative Analytics
Iterate faster for accelerated insights with real-time social collaboration
Make projects more transparent
Collaborate within projects, share information across teams
GREENPLUM CHORUS
![Page 14: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/14.jpg)
14 © Copyright 2012 EMC Corporation. All rights reserved.
The Power of Data Co-Processing
![Page 15: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/15.jpg)
15 © Copyright 2012 EMC Corporation. All rights reserved.
THE ANSWER
MACHINE DATA IN. DECISIONS OUT.
Greenplum Modular
Data Computing Appliance
![Page 16: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/16.jpg)
16 © Copyright 2012 EMC Corporation. All rights reserved.
Start With a High Speed Interconnect…
2 GPDB Master Servers
2 10GE Switches
Administrative Switch
Functional Module
Functional Module
Functional Module
Functional Module
GREENPLUM DCA
![Page 17: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/17.jpg)
17 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Data Computing Appliance A Revolutionary Modular Architecture
Greenplum Database
Standard Module
9TB capacity (uncompressed)
Each server contains: • 2 sockets/12cores -
48GB Memory • 12x 600GB storage
Greenplum Database High
Capacity Module
31TB capacity (uncompressed)
Each server contains: • 2 sockets/12cores -
48GB Memory • 12x 2TB storage
Greenplum HD Module
28TB capacity (3 copies, uncompressed)
Each server contains: • 2 sockets/12cores -
48GB Memory • 12x 2TB storage
Greenplum Data Integration
Accelerator (DIA) Module
70TB capacity
Each server contains: • 2 sockets/12cores -
48GB Memory • 12x 2TB storage
HD
DIA
GPDB
GPDB
GREENPLUM DCA
![Page 18: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/18.jpg)
18 © Copyright 2012 EMC Corporation. All rights reserved.
Scale to Multiple Racks In Granular Quarter Rack Increments
1st Rack
Add ¼ rack Increments
+
Aggregation Rack
Add ¼ rack Increments
+
Functional Module
Functional Module
Functional Module
Greenplum DIA Module
Greenplum Database Modules
or
or
Greenplum HD
Module
Greenplum DIA Module
Greenplum Database Modules
or
or
Greenplum HD
Module
Functional Module
Functional Module
Functional Module
Functional Module
Greenplum Database Module
(required)
![Page 19: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/19.jpg)
19 © Copyright 2012 EMC Corporation. All rights reserved.
Seamless Infrastructure Integration
EMC Data Domain Efficient Backup & Restore
EMC VMAX or VNX SAN Mirror For Advanced Storage
Management
Isilon Scale Out Storage For Big Data Staging
EMC VMAX SRDF EMC Data Domain
Replication For Disaster Recovery
GREENPLUM DCA
![Page 20: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/20.jpg)
20 © Copyright 2012 EMC Corporation. All rights reserved.
![Page 21: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/21.jpg)
21 © Copyright 2012 EMC Corporation. All rights reserved.
Examples using Informatica PowerCenter
PowerConnect for Greenplum
![Page 22: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/22.jpg)
22 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum und Infa Produktives Mapping Es werden ca. 600 Mio-Datensätze mit einem Parallelitätsgrad 20 aus Files gelesen. Die Ergebnisse werden mit einem Parallelitätsgrad 20 in 6 Datenbanktabellen geschrieben ==: 6*20 = 120 parallele Datenstreams
Gelesen aus der Quelle: 560.775.979 Records Geschrieben in Greenplum: 558.373.471 Records Fehler : 0 Verarbeitungszeit: ~45min Davon Informatica: 43 min Greenplum: 2‘35“ (aus den Loaderlogs) Ca. 3.602.409 rec/s
![Page 23: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/23.jpg)
23 © Copyright 2012 EMC Corporation. All rights reserved.
Daten aus Greenplum lesen:
Produktives Mapping “Step 2” Lesen der von Schritt 1 geschriebenen Daten und weitere Verarbeitung. Wieder 20fach parallel.
![Page 24: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/24.jpg)
24 © Copyright 2012 EMC Corporation. All rights reserved.
Blick auf die Session:
Laufzeit für ~360Mio Datensätze: 7’15” mit 20*~45.000 rec/s = 900.000 rec/s
![Page 25: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/25.jpg)
25 © Copyright 2012 EMC Corporation. All rights reserved.
Blick auf den Greenplum Monitor:
Durchsatz 10Gb Network: 10Gb entspricht
10.737.418.240 bit/s
1.342.177.280 Byte/s
1.310.720 kB/s
1.280 MB/s Bruttorate nutzbar für Protokoll+Daten
Damit ist bei ca. 15% Auslastung der SegmentServer eine 90% Auslastung eines 10Gb Interfaces erreicht. 85% Freie DB-Rechenkapazität ist für Analytic verfügbar !
![Page 26: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/26.jpg)
26 © Copyright 2012 EMC Corporation. All rights reserved.
Ergebnisse der Tests
Informatica lastet einen DIA-ETL-Server mit 2-CPU und 12 Kernen > 90% aus. Möglich durch den Durchsatz über die genutzten 10Gb-Interfaces, DB-Performance.
![Page 27: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/27.jpg)
27 © Copyright 2012 EMC Corporation. All rights reserved.
Pushdown Technology
Mapping : Zwei Datenmengen werden „gejoint“, eine Expression und ein Aggregator ausgeführt
Message: Optimizer generated SQL statement for target [ev.Test_Install]:
INSERT INTO dev.test_install_pd ( Integer_FLD, Bigint, Date_FLD, varchar_FLD ) SELECT CAST(COUNT(test_install1.integer_fld) AS FLOAT), CAST(COUNT(test_install1.bigint) AS FLOAT), MAX(test_install2.date_fld), MIN(UPPER(test_install2.varchar_fld)) FROM (test_install test_install1 INNER JOIN test_install test_install2 ON ((( (test_install2.integer_fld = test_install1.integer_fld) AND (test_install2.bigint = test_install1.bigint)) AND (test_install2.date_fld = test_install1.date_fld)) AND (test_install2.varchar_fld = test_install1.varchar_fld))) GROUP BY test_install1.integer_fld, test_install1.bigint, test_install2.date_fld, UPPER(test_install2.varchar_fld)
Fazit “PushDown” Informatica kann Logik vollständig in die Greenplum Datenbank abgeben. D.h. es findet kein Datentransfer zwischen DB und ETL statt. Dabei ist die Datenbank unabhängig von Mapping/Session immer massiv parallel.
![Page 28: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/28.jpg)
28 © Copyright 2012 EMC Corporation. All rights reserved.
Fragen ?
![Page 29: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices](https://reader033.fdocuments.net/reader033/viewer/2022043004/5f88a7819c6c74317c155043/html5/thumbnails/29.jpg)
29 © Copyright 2012 EMC Corporation. All rights reserved.
Vielen Dank !