Greenplum Database Overview
-
Upload
emc-academic-alliance -
Category
Technology
-
view
4.532 -
download
0
description
Transcript of Greenplum Database Overview
1 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Database Overview
Michael Crutcher Greenplum Product Management
2 © Copyright 2012 EMC Corporation. All rights reserved.
3 © Copyright 2012 EMC Corporation. All rights reserved.
4 © Copyright 2012 EMC Corporation. All rights reserved.
5 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Unified Analytic Platform
6 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM DATABASE
Industry Leading Database with
Massively Parallel Performance
To Empower your Analytics
7 © Copyright 2012 EMC Corporation. All rights reserved.
Extreme Performance for Analytics
Optimized for BI and analytics
– Deep integration with statistical packages
– High performance parallel implementations
• Simple and automatic
– Just load and query like any database
– Tables are automatically distributed across nodes
• Extremely scalable
– MPP shared-nothing architecture
– All nodes can scan and process in parallel
– Linear scalability by adding nodes
GREENPLUM DATABASE
8 © Copyright 2012 EMC Corporation. All rights reserved.
Performance Through Parallelism
GREENPLUM DATABASE
Network Interconnect
... ...
... ... Master Servers
Query planning & dispatch
Segment Servers
Query processing & data storage
External Sources
Loading, streaming, etc.
9 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum Data Computing Appliance
Choose Greenplum Database and/or Hadoop modules in ¼ rack increments
Scale up by adding your choice of additional modules
Minimal time to value
Greenplum Software Solutions
Greenplum Database, Hadoop, & Chorus on your x86 hardware
Flexibility for any workload or environment
Perpetual or subscription licenses
Greenplum Delivers Choice & Flexibility
GREENPLUM DATABASE
10 © Copyright 2012 EMC Corporation. All rights reserved.
Core Functionality
GREENPLUM DATABASE
11 © Copyright 2012 EMC Corporation. All rights reserved.
Component Overview
PRODUCT FEATURES
CLIENT ACCESS & TOOLS
Multi-Level Fault Tolerance (RAID, Mirroring, DR with
Data Domain Boost)
Shared-Nothing MPP
Parallel Query Optimizer
Polymorphic Data Storage™
CLIENT ACCESS
ODBC, JDBC, OLEDB,
MapReduce, etc.
CORE MPP ARCHITECTURE
Parallel Dataflow Engine
gNet™ Software Interconnect
Scatter/Gather Streaming™ Data Loading
Online System Expansion Workload Management
GREENPLUM DATABASE ADAPTIVE
SERVICES
LOADING & EXT. ACCESS
Petabyte-Scale Loading
Trickle Micro-Batching
Anywhere Data Access
STORAGE & DATA ACCESS
Hybrid Storage & Execution (Row- & Column-Oriented)
In-Database Compression
Multi-Level Partitioning
Indexes – Btree, Bitmap, etc.
External Table Support
LANGUAGE SUPPORT
Comprehensive SQL
Native MapReduce
SQL 2003 OLAP Extensions
Programmable Analytics
Analytics Extensions (GeoSpatial, PR/R, PL/Java,
PL/Python, PL/Perl)
3rd PARTY TOOLS
BI Tools, ETL Tools
Data Mining, etc
ADMIN TOOLS
Greenplum Command Center
Greenplum Package Manager
GREENPLUM DATABASE
12 © Copyright 2012 EMC Corporation. All rights reserved.
SINGLE RACK COMPARISON
Most Powerful Data Loading Capabilities
Industry leading performance at 10+TB per-hour per-rack
Scatter-Gather Streaming™ provides true linear scaling
Support for both large-batch and continuous real-time loading strategies
Enable complex data transformations ―in-flight‖
Transparent interfaces to loading via support files, application, and services
Greenplum load rates scale linearly with the number of racks, others do not.
For example, two racks = >20TB/H
Greenplum Oracle Exadata
Netezza Teradata
GREENPLUM DATABASE
13 © Copyright 2012 EMC Corporation. All rights reserved.
Polymorphic Table StorageTM
• Storage types can be mixed within a table or database
– Four table types: heap, row-oriented AO, column-oriented AO, external
• Rich compression functionality, definable column by column
– Block compression: Gzip (levels 1-9), QuickLZ
– Stream compression: RLE (levels 1-4)
• Flexible indexing, partitioning, and more
TABLE ‗CUSTOMER‘
Mar ‗11
Apr ‗11
May ‗11
Jun ‗11
Jul ‗11
Aug ‗11
Sept ‗11
Oct ‗11
Nov ‗11
Row-oriented for HOT DATA Column-oriented for COLD DATA
GREENPLUM DATABASE
14 © Copyright 2012 EMC Corporation. All rights reserved.
A supercomputing-based ―soft-switch‖
responsible for
– Efficiently pumping streams of data between motion
nodes during query-plan execution
– Delivers messages, moves data, collects results, and
coordinates work among the segments in the system
gNet Software Interconnect
gNet Software Interconnect
GREENPLUM DATABASE
15 © Copyright 2012 EMC Corporation. All rights reserved.
Parallel Query Optimizer
Cost-based optimization
looks for the most
efficient plan
Physical plan contains
scans, joins, sorts,
aggregations, etc.
Global planning avoids
sub-optimal ‘SQL
pushing’ to segments
Directly inserts ‘motion’
nodes for inter-segment
communication
PHYSICAL EXECUTION PLAN
FROM SQL OR MAPREDUCE
Gather Motion 4:1(Slice 3)
Sort
HashAggregate
HashJoin
Redistribute Motion 4:4(Slice 1)
HashJoin
Hash Hash
HashJoin
Hash
Broadcast Motion 4:4(Slice 2)
Seq Scan on motion
Seq Scan on customer
Seq Scan on lineitem
Seq Scan on orders
GREENPLUM DATABASE
16 © Copyright 2012 EMC Corporation. All rights reserved.
Analytics Overview
GREENPLUM DATABASE
17 © Copyright 2012 EMC Corporation. All rights reserved.
Greenplum gNet
Data Access & Query Layer
GREENPLUM HD
Analytical Capabilities Overview
Stored Procedures
MapReduce
Polymorphic Storage
SQL 2003 OLAP
SQL
GREENPLUM DATABASE
ODBC JDBC
GREENPLUM DATABASE
In-Database Analytics
18 © Copyright 2012 EMC Corporation. All rights reserved.
Data Access & Query Layer
SQL
GREENPLUM DATABASE
ODBC JDBC
In-Database Analytics: Categories
In-Database Analytics
Partner
Open-Source
User-written
Embedded
SAS/HPA High Performance Analytics
SAS Scoring Accelerator
Open Source Extensions
User-Written Analytical Algorithms
GPDB Embedded Analytics
GREENPLUM DATABASE
19 © Copyright 2012 EMC Corporation. All rights reserved.
Analytics Highlight: MADlib
Scalable in-database analytics
Data-parallel – Mathematical Algorithms
– Statistical Algorithms
– Machine learning Algorithms
– Supports structured and unstructured data.
Open-source software – Source Accessibility
– Converge business, academic, and open-source communities
GREENPLUM DATABASE
20 © Copyright 2012 EMC Corporation. All rights reserved.
Manageability, Extensions
GREENPLUM DATABASE
21 © Copyright 2012 EMC Corporation. All rights reserved.
Single console for both Database and Hadoop
Administration – Start, Stop Database – Recover, Rebalance Segments
Interactive view of System Metrics – Real-time – Historic (Configurable by time period)
In-depth view for System Health – Hardware health – Software (Database, Hadoop)
Query Monitoring – Search, Prioritize, Cancel Queries – View Query‘s Execution Plan
Workload Management – Configure Resource Queues – Prioritize Users
Easy Manageability for Big Data
GREENPLUM DATABASE
22 © Copyright 2012 EMC Corporation. All rights reserved.
Master Servers
Segment Servers ... ...
Greenplum supports easy deployment of numerous extensions like Madlib, PL/Perl, PL/Java, PostGIS, etc.
GREENPLUM DATABASE
Easy Extension Installation Greenplum Package Manager
23 © Copyright 2012 EMC Corporation. All rights reserved.
Connect any data set in Hadoop to GP DB‘s SQL Engine
Process Hadoop data in place
Parallelize import/export data from/to Hadoop thanks to GP DB‘s market leading data sharing performance
Supported formats: – Text (compressed and
uncompressed)
– binary
– proprietary/user-defined
GP HD 1.x, GP MR 1.x, CDH3u2
Text Binary User-
Defined
gNet for Hadoop
High Performance gNet for Hadoop Parallel Query Access
GREENPLUM DATABASE
24 © Copyright 2012 EMC Corporation. All rights reserved.
High Availability, Back up, Support
GREENPLUM DATABASE
25 © Copyright 2012 EMC Corporation. All rights reserved.
GPDB cluster – 2 Master servers
– Multiple Segment servers
Segment servers support multiple database instances
– Primary instances that actively process queries
– Standby mirror instances
Block level mirroring – Low resource
consumption
– Differential resynch capable for fast recovery
Set of Active Segment Instances
High Availability
GREENPLUM DATABASE
26 © Copyright 2012 EMC Corporation. All rights reserved.
Backup/Restore with EMC Data Domain
Integration options – NFS: Data Domain device mounted
as NFS storage
– DD Boost: Native, client-side deduplication. Supported in GPDB 4.2 and higher
Drastic reduction in backup storage requirement
Backup all segment servers in parallel directly to Data Domain
Data Domain Integrates seamlessly into standard Greenplum full backup data export and data restore procedures
GREENPLUM DATABASE
Full Appliance
+ Data Domain
Boost or NFS
2 X 10GBit IP
27 © Copyright 2012 EMC Corporation. All rights reserved.
Ideal for configurations with RPO and RTO requirements that can be specified in hours
Supports:
– Collection Replication for DD Boost backup
– Directory-level replication for NFS backup
– Encryption over the WAN
Data Domain Replication
LAN/WAN
Greenplum DCA Greenplum DCA
Data Domain Data Domain
GREENPLUM DATABASE
Backup and restore between remote and primary sites Backup/Restore with EMC Data Domain
28 © Copyright 2012 EMC Corporation. All rights reserved.
Customer Support Services
• Remote Technical Support
– 24x7 technical support and remote troubleshooting
– Customer-managed case severity level
– Four-hour response objective
• Onsite Support (DCA Only)
– Installation of replacement parts
– Replacement parts shipped for next business day arrival
– GP SW upgrade included
• Proactive Service
– Secure remote monitoring for hardware (DCA)
– Notification of engineering technical advisories
– Built-in tools maximize stability and performance
• Secure Self-Help
– 24x7 access to eService support tools including
knowledgebase, forums, and appropriately licensed
software updates
GREENPLUM DATABASE
29 © Copyright 2012 EMC Corporation. All rights reserved.
GREENPLUM DATABASE
Other Relevant Greenplum Sessions
Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00
Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15
Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00
Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00
Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30
Big Data Driven Businesses in Action: Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers)
Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30
Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45
Disruptive Data Science — How Data Science and Big Data are Transforming Business, IT and People
Annika Jimenez David Dietrich
Tues 4:15-5:15 Thurs 11:30-12:30
30 © Copyright 2012 EMC Corporation. All rights reserved.
Thank You