Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle...

29
1 © Copyright 2012 EMC Corporation. All rights reserved. EMC Greenplum Wolfgang Disselhoff Sr. Technology Architect, Greenplum André Münger Sr. Account Manager, Greenplum Big Data meets Big Integration

Transcript of Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle...

Page 1: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

1 © Copyright 2012 EMC Corporation. All rights reserved.

EMC Greenplum

Wolfgang Disselhoff Sr. Technology Architect, Greenplum André Münger Sr. Account Manager, Greenplum

Big Data meets Big Integration

Page 2: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

2 © Copyright 2012 EMC Corporation. All rights reserved.

Page 3: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

3 © Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM DATABASE

Industry-Leading

Massively Parallel

Processing (MPP)

Performance

Page 4: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

4 © Copyright 2012 EMC Corporation. All rights reserved.

Overview MPP Systems / Must-Haves

High Availability, Fault tolerance, Failover, Backup and Recovery

Simpler Architecture needed

Decreased maintenance

Performance improvements (by factors for Load and Enduser)

Highly improved cost / performance ration

Linear, „unlimited“ scalability

Fully supporting semi- and unstructured data (e.g. Hadoop)

Rich Ecosystem (Interfaces / Partner / Migration)

Company stability / strategy

Page 5: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

5 © Copyright 2012 EMC Corporation. All rights reserved.

Extreme Performance for Analytics

Optimized for BI and analytics

– Deep integration with statistical packages

– High performance parallel implementations

• Simple and automatic

– Just load and query like any database

– Tables are automatically distributed across nodes

• Extremely scalable

– MPP shared-nothing architecture

– All nodes can scan and process in parallel

– Linear scalability by adding nodes

GREENPLUM DATABASE

Page 6: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

6 © Copyright 2012 EMC Corporation. All rights reserved.

Segment Servers

Query processing & data storage

... ...

Master Server

Query planning & dispatch

Hadoop MapReduce

Data Sources

Loading, streaming, etc.

Network Interconnect

External Files, URLs, Hadoop (HDFS), WebServices (including from other DBs),

O/S Pipes (including from other DBs)

Standard Business Intelligence and Analytical tools

SQL BI tools

Analytical tools

Queries distributed across all available

resources

Shared Nothing, Massively Parallel Processing means

no bottlenecks and linear scalability.

Data loading also takes advantage of MPP architecture

Greenplum handles structured, semi-

structured and unstructured data

Clients sees a single database

and “new gen” tools: R, rattle, perl, python, PL/java etc

primary server, plus hot failover

Massively Parallel Processing And Linear Performance Scalability

Page 7: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

7 © Copyright 2012 EMC Corporation. All rights reserved.

High Availability

Master

Segment Segment Segment Segment

Master

Master Server Data Protection Replicated transaction logs for server failure

Optional RAID protection for drive failures

Upon server failure

Standby server activated

Administrator alerted

Orchestrated failover

Segment Server Data Protection Mirrored segments for server failures

Optional RAID protection for drive failures

Upon server failure

Mirrored segments take over with no loss of

service

Fast online differential recovery

GREENPLUM DATABASE

Page 8: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

8 © Copyright 2012 EMC Corporation. All rights reserved.

Migration using Informatica

BI T

ools

Reportin

g

Analy

tics

Data

In

tegra

tion

Oracle Environment

EMC GP DCA

Oracle Data

Warehouse

• Complete set of migration best practices

• Simple 3rd-party reporting and analytic tool integration

• Greenplum training specifically for Oracle DBAs

• High speed data export, transformation, and load

• Support for proprietary Oracle SQL functions for continuing use of Oracle DB talent

Page 9: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

9 © Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM HD

Delivering Enterprise-Ready Apache Hadoop

Page 10: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

10 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum HD Architecture G

REEN

PLU

M C

OM

MA

ND

CEN

TER

Pluggable Storage Layer (HDFS API)

MapReduce Layer

Hadoop Tools (Pig, Hive, HBase, Zookeeper, Mahout, etc…)

Apache HDFS

Greenplum Chorus

Isilon OneFS

GREENPLUM HD

Page 11: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

11 © Copyright 2012 EMC Corporation. All rights reserved.

GREENPLUM CHORUS

The World’s First Agile Analytics Productivity Platform

Page 12: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

12 © Copyright 2012 EMC Corporation. All rights reserved.

Agile Analytics

Faster, Easier with Chorus

Project Workspaces

Data Analysis

Publish and

Iterate

Explore the

Data

Collaboration

GREENPLUM CHORUS

Page 13: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

13 © Copyright 2012 EMC Corporation. All rights reserved.

Collaborative Analytics

Iterate faster for accelerated insights with real-time social collaboration

Make projects more transparent

Collaborate within projects, share information across teams

GREENPLUM CHORUS

Page 14: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

14 © Copyright 2012 EMC Corporation. All rights reserved.

The Power of Data Co-Processing

Page 15: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

15 © Copyright 2012 EMC Corporation. All rights reserved.

THE ANSWER

MACHINE DATA IN. DECISIONS OUT.

Greenplum Modular

Data Computing Appliance

Page 16: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

16 © Copyright 2012 EMC Corporation. All rights reserved.

Start With a High Speed Interconnect…

2 GPDB Master Servers

2 10GE Switches

Administrative Switch

Functional Module

Functional Module

Functional Module

Functional Module

GREENPLUM DCA

Page 17: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

17 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum Data Computing Appliance A Revolutionary Modular Architecture

Greenplum Database

Standard Module

9TB capacity (uncompressed)

Each server contains: • 2 sockets/12cores -

48GB Memory • 12x 600GB storage

Greenplum Database High

Capacity Module

31TB capacity (uncompressed)

Each server contains: • 2 sockets/12cores -

48GB Memory • 12x 2TB storage

Greenplum HD Module

28TB capacity (3 copies, uncompressed)

Each server contains: • 2 sockets/12cores -

48GB Memory • 12x 2TB storage

Greenplum Data Integration

Accelerator (DIA) Module

70TB capacity

Each server contains: • 2 sockets/12cores -

48GB Memory • 12x 2TB storage

HD

DIA

GPDB

GPDB

GREENPLUM DCA

Page 18: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

18 © Copyright 2012 EMC Corporation. All rights reserved.

Scale to Multiple Racks In Granular Quarter Rack Increments

1st Rack

Add ¼ rack Increments

+

Aggregation Rack

Add ¼ rack Increments

+

Functional Module

Functional Module

Functional Module

Greenplum DIA Module

Greenplum Database Modules

or

or

Greenplum HD

Module

Greenplum DIA Module

Greenplum Database Modules

or

or

Greenplum HD

Module

Functional Module

Functional Module

Functional Module

Functional Module

Greenplum Database Module

(required)

Page 19: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

19 © Copyright 2012 EMC Corporation. All rights reserved.

Seamless Infrastructure Integration

EMC Data Domain Efficient Backup & Restore

EMC VMAX or VNX SAN Mirror For Advanced Storage

Management

Isilon Scale Out Storage For Big Data Staging

EMC VMAX SRDF EMC Data Domain

Replication For Disaster Recovery

GREENPLUM DCA

Page 20: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

20 © Copyright 2012 EMC Corporation. All rights reserved.

Page 21: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

21 © Copyright 2012 EMC Corporation. All rights reserved.

Examples using Informatica PowerCenter

PowerConnect for Greenplum

Page 22: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

22 © Copyright 2012 EMC Corporation. All rights reserved.

Greenplum und Infa Produktives Mapping Es werden ca. 600 Mio-Datensätze mit einem Parallelitätsgrad 20 aus Files gelesen. Die Ergebnisse werden mit einem Parallelitätsgrad 20 in 6 Datenbanktabellen geschrieben ==: 6*20 = 120 parallele Datenstreams

Gelesen aus der Quelle: 560.775.979 Records Geschrieben in Greenplum: 558.373.471 Records Fehler : 0 Verarbeitungszeit: ~45min Davon Informatica: 43 min Greenplum: 2‘35“ (aus den Loaderlogs) Ca. 3.602.409 rec/s

Page 23: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

23 © Copyright 2012 EMC Corporation. All rights reserved.

Daten aus Greenplum lesen:

Produktives Mapping “Step 2” Lesen der von Schritt 1 geschriebenen Daten und weitere Verarbeitung. Wieder 20fach parallel.

Page 24: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

24 © Copyright 2012 EMC Corporation. All rights reserved.

Blick auf die Session:

Laufzeit für ~360Mio Datensätze: 7’15” mit 20*~45.000 rec/s = 900.000 rec/s

Page 25: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

25 © Copyright 2012 EMC Corporation. All rights reserved.

Blick auf den Greenplum Monitor:

Durchsatz 10Gb Network: 10Gb entspricht

10.737.418.240 bit/s

1.342.177.280 Byte/s

1.310.720 kB/s

1.280 MB/s Bruttorate nutzbar für Protokoll+Daten

Damit ist bei ca. 15% Auslastung der SegmentServer eine 90% Auslastung eines 10Gb Interfaces erreicht. 85% Freie DB-Rechenkapazität ist für Analytic verfügbar !

Page 26: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

26 © Copyright 2012 EMC Corporation. All rights reserved.

Ergebnisse der Tests

Informatica lastet einen DIA-ETL-Server mit 2-CPU und 12 Kernen > 90% aus. Möglich durch den Durchsatz über die genutzten 10Gb-Interfaces, DB-Performance.

Page 27: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

27 © Copyright 2012 EMC Corporation. All rights reserved.

Pushdown Technology

Mapping : Zwei Datenmengen werden „gejoint“, eine Expression und ein Aggregator ausgeführt

Message: Optimizer generated SQL statement for target [ev.Test_Install]:

INSERT INTO dev.test_install_pd ( Integer_FLD, Bigint, Date_FLD, varchar_FLD ) SELECT CAST(COUNT(test_install1.integer_fld) AS FLOAT), CAST(COUNT(test_install1.bigint) AS FLOAT), MAX(test_install2.date_fld), MIN(UPPER(test_install2.varchar_fld)) FROM (test_install test_install1 INNER JOIN test_install test_install2 ON ((( (test_install2.integer_fld = test_install1.integer_fld) AND (test_install2.bigint = test_install1.bigint)) AND (test_install2.date_fld = test_install1.date_fld)) AND (test_install2.varchar_fld = test_install1.varchar_fld))) GROUP BY test_install1.integer_fld, test_install1.bigint, test_install2.date_fld, UPPER(test_install2.varchar_fld)

Fazit “PushDown” Informatica kann Logik vollständig in die Greenplum Datenbank abgeben. D.h. es findet kein Datentransfer zwischen DB und ETL statt. Dabei ist die Datenbank unabhängig von Mapping/Session immer massiv parallel.

Page 28: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

28 © Copyright 2012 EMC Corporation. All rights reserved.

Fragen ?

Page 29: Big Data meets Big Integration - Die DINGMigration using Informatica s ng cs on Data Oracle Environment EMC GP DCA Oracle Data Warehouse •Complete set of migration best practices

29 © Copyright 2012 EMC Corporation. All rights reserved.

Vielen Dank !