High performance analytics sas greenplum sunz 2012

39
1 © Copyright 2011 EMC Corpora2on. All rights reserved. Data Computing Division EMC ACQUIRES GREENPLUM Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing.– InformationWeek – Greenplum Becomes the Foundation of EMCs Data Computing Division For three years, Gartner has identified Greenplum as the most advanced vendor in the visionary quadrant of its data warehouse DBMS Magic Quadrant.– Gartner
  • date post

    19-Oct-2014
  • Category

    Technology

  • view

    2.124
  • download

    5

description

 

Transcript of High performance analytics sas greenplum sunz 2012

Page 1: High performance analytics sas greenplum sunz 2012

1  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

E M C A C Q U I R E S G R E E N P L U M

“Greenplum, with expertise in the massively parallel arena, will give the storage giant a boost in big-data computing.”

– InformationWeek –

Greenplum Becomes the Foundation of EMC’s Data Computing Division

“For three years, Gartner has identified Greenplum as the most advanced vendor in the visionary

quadrant of its data warehouse DBMS Magic Quadrant….” – Gartner

Page 2: High performance analytics sas greenplum sunz 2012

2  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

         

New  Reali2es…                  New  Demands!  •  Do  it  faster  

–  Ingest  more  data  –  Ingest  it  faster  –  Keep  it  unsummarised,  keep  it  for  longer  

•  Be  more  Responsive  –  Unpredictable  queries,  Rapidly  evolving  bespoke  analy2cs  –  New  tools:  Hadoop,  MapReduce,  Hive,  HBase,  “R”  

•  Manage  new  data  types  –  Manage  and  allow  queries  across  structured,  semi-­‐structured  and  unstructured  data  

•  Do  it  at  a  lower  cost  

Big  Data  will  revolu/onise    Data  Warehousing  and  analy/cs.  

Page 3: High performance analytics sas greenplum sunz 2012

3  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Why Greenplum?

Fast Data

Loading Extreme Performance & Elastic Scalability

Unified Data Access

•  EMC Greenplum is a shared nothing, massively parallel processing (MPP) data warehouse system

•  Core principle of data computing is to move the processing dramatically closer to the data and to the people

Page 4: High performance analytics sas greenplum sunz 2012

4  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Segment Servers

Query processing & data storage

... ...

Master Server

Query planning & dispatch

Hadoop MapReduce

Data Sources

Loading, streaming, etc.

Network Interconnect

External Files, URLs, Hadoop (HDFS), WebServices (including from other DBs),

O/S Pipes (including from other DBs)

Standard  Business  Intelligence  and  Analy2cal  tools    

SQL BI tools

Analytical tools

Queries  distributed  across  all  available  

resources    

Shared  Nothing,  Massively  Parallel  Processing  means  no  boSlenecks  and  linear  scalability.    

Data  loading  also  takes  advantage  of  MPP  architecture  

Greenplum  handles  structured,  semi-­‐structured  and  

unstructured  data  

Clients  see  a  single  database    

Structured  Analy2cs            Unstructured  Analy2cs  

primary server, plus hot failover

Page 5: High performance analytics sas greenplum sunz 2012

5  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Why is MPP different?

Greenplum is a Scale-Out Architecture on standard commodity hardware

MPP •  Queries shipped to each node simultaneously •  Execute parallel on each segment instance. •  Multiple pipe lines of data •  Highly Scalable topology •  Locks and buffers not shared.

Traditional •  Single database buffer used by all user

operations •  More locks, means more complex lock

management system •  Single pipe to data •  Limited Scalability

Page 6: High performance analytics sas greenplum sunz 2012

6  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division 20/02/12 6

Par22oning:  The  Key  to  Parallelism Strategy: Spread data evenly across as many nodes (and disks) as possible

43   Oct  20  2005   12  64   Oct  20  2005   111  45   Oct  20  2005   42  46   Oct  20  2005   64  77   Oct  20  2005   32  48   Oct  20  2005   12  

Order

Ord

er #

Ord

er

Dat

e

Cus

tom

er

ID

Greenplum Database High Speed Loader

50   Oct  20  2005   34  56   Oct  20  2005   213  63   Oct  20  2005   15  44   Oct  20  2005   102  53   Oct  20  2005   82  55   Oct  20  2005   55  

Page 7: High performance analytics sas greenplum sunz 2012

7  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Greenplum Database Powerful Data Loading Capabilities •  Industry leading performance:

–  >10TB per hour per rack •  Innovative, parallel-everything

architecture: –  Scatter-Gather Streaming™

provides true linear scaling –  Support for both large-batch

and continuous real-time loading strategies

–  Enable complex data transformations “in-flight”

–  Transparent interfaces to loading via support files, application and services

Page 8: High performance analytics sas greenplum sunz 2012

8  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Tradi2onal  Loading  vs  Greenplum  DB  Parallel  Loading

Segment nodes

Segment nodes

Segment nodes

Segment nodes

Interconnect

Conventional Loading

ETL  Servers

Interconnect

ETL  Servers

Page 9: High performance analytics sas greenplum sunz 2012

9  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Client  1 4 2 9

7 3 11 6

8 12 5 10

Sort Request Sort Request Sort Request

Advanced pipeline process for fast operation

Master  Server  

Segment  Servers

Page 10: High performance analytics sas greenplum sunz 2012

10  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

12 11 10

9

8 7 6

5

4

3 2 1

Advanced pipeline process for fast operation

Master  Server  

Segment  Servers

Client  

Page 11: High performance analytics sas greenplum sunz 2012

11  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Greenplum Database Extreme Performance

•  Optimized for BI and Analytics –  Rich eco-system of partners

•  Provides automatic parallelization –  Just load and query like any database –  Tables are automatically distributed across

nodes –  No need for manual partitioning or tuning

•  Extremely scalable MPP shared-nothing Architecture

–  All nodes can scan and process in parallel –  Linear scalability by adding nodes

Interconnect

Loading

Page 12: High performance analytics sas greenplum sunz 2012

12  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Pla^orm  Independence  Delivers  Choice  and  Flexibility  

So2ware-­‐Only  •   On  your  x86  hardware  •   Flexibility  for  any  workload  

Virtualized  Infrastructure  •   Pool  resources  •   Elas2c  scalability  

Data  Compu@ng  Appliance  •   Op2mized  Price/Performance  •   Minimum  2me-­‐to-­‐value  •   Ideal  for  Produc@on  Environments  

Page 13: High performance analytics sas greenplum sunz 2012

13  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Table ‘Customer’

Jan ’09 Feb ’09 Mar ’09 Apr ’09 May ’09 Jun ’09 Jul ’09 Aug ’09 Sept ’09 Oct ’09 Nov ’09

Column-Oriented Archival Compression

Column-Oriented Fast Compression

Row-Oriented Fast Compression

Greenplum Polymorphic Data Storage

•  Greenplum Database’s engine provides a flexible storage model –  Four table types: heap, row-oriented, column-oriented, external –  Block compression: Gzip (levels 1-9), QuickLZ

•  Storage types can be mixed within a database, and even within a table –  Fully configurable via table DDL and partitioning syntax –  You may also choose to index some partitions and not others

•  Gives customers the choice of processing model for any table or partition –  Supports ILM scenarios – denser packing of older partitions, etc. –  Tables/partitions of different storage types can be joined together without restriction –  Highly tuned – e.g. columnar does efficient pre-projection and parallel execution

Page 14: High performance analytics sas greenplum sunz 2012

14  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Unified Data Access Across The Enterprise •  Workload Management

–  Connection management controls how many users can be connected and assigns them to a queue

–  User-based resource queues allow for control of the total number or cost of queries allowed at any point in time.

•  Dynamic Query Prioritization –  Patent pending technique of dynamically

balancing resources across running queries

–  Allows DBAs to control query priorities in real-time, or determine default priorities by resource queue

Page 15: High performance analytics sas greenplum sunz 2012

15  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Highly interactive web-based performance monitoring

Real-time and historic views of:

•  Resource utilization

•  Queries and query internals

Greenplum Performance Monitor

Page 16: High performance analytics sas greenplum sunz 2012

16  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Key Technical Requirements for HPA Ø  Technical Values

ü  Performance - Massively parallel Architecture ü  Load speeds – 10TB/hr ü  Integration with SAS ü  In-database analytics using Java, PL/R, etc ü  Integration with many more BI, Analytical tools, ü  Integration with Hadoop for unstructured data analysis

Ø  Financial Value ü  Lower Total cost of ownership ü  Best Price/performance Ratio in the industry for EDW/ analytical

appliance Ø  Operational Values

ü  No Indices maintenance ü  Backup recovery solution ü  Most robust Disaster Recovery Solution in Industry ü  Best Technical and customer Support Organization backing

Page 17: High performance analytics sas greenplum sunz 2012

17  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

A Few SAS Generalisations

Ø Large sequential reads and writes Ø Reading and Writing of data is done via the OS’s file

cache Ø I/O throughput rate is restricted by how fast the OS’s file

cache can process the data Ø A lot of temporary files can be created .

Page 18: High performance analytics sas greenplum sunz 2012

18  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

An MPP SQL query – just for fun

• 44TB and the query planner executes a sequential scan. There are 1,218 million rows of data and 1000 columns. 5 concurrent users running the same query on a monthy data set.

• As a base line: a single node on a typical high-end server with a single controller can read about 1.5GB per second into the database. So, a DBMS deployed on a single node can scan our 44TB in 40.7 hours.

Page 19: High performance analytics sas greenplum sunz 2012

19  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

An MPP SQL query – just for fun

•  If we deploy over 8 nodes on a Greenplum cluster the aggregate I/O bandwidth increases linearly to 12GB/sec. Our query will complete in 61 minutes.

•  If we compress the rows then we can read more data with each I/O. Compression varies but 2.5X is a reasonable estimate. So our effective scan rate improves by 2.5 and our query completes in 24.4 minutes.

Page 20: High performance analytics sas greenplum sunz 2012

20  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

An MPP SQL query – just for fun

• Partitioning allows us to split the data on each segment by a known value, by month in our example and if possible, read only the partitions selected. We scan only 1/84th (7 x 12 months) of the table. Our query completes in 17.4 seconds.

• Columnar, based compression is more effective than row based compression. 10X columnar based compression is a conservative estimate…10X is 4 times better than the 2.5X row compression already built into our example. So now our table scan completes in 4.35 seconds.

Page 21: High performance analytics sas greenplum sunz 2012

21  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

An MPP SQL query – just for fun

• Columnar projection lets us perform I/O on only the columns we are interested in. Lets assume 500 of the 1000 columns in our example. By reading only 50% of the data we reduce our I/O by 50%. And our table scan completes in 2.175 seconds. If 5 people were executing the same query concurrently and each person was configured to have an equal share of the system resources then each persons query would complete in 10.9 seconds.

Page 22: High performance analytics sas greenplum sunz 2012

22  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

An MPP SQL query – just for fun

• Note that queries that touch two months touch twice as much data and would complete in 4.35 seconds, four months in 8.7 seconds, and so on it is scalable and robust

• Also note that joins are also implemented using a

shared-nothing approach, meaning that they scale up as well

• We can apply indexes if necessary to further improve query performance.

Page 23: High performance analytics sas greenplum sunz 2012

23  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

An MPP SQL query – Summary

Page 24: High performance analytics sas greenplum sunz 2012

24  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Mul2ple  op2ons  for  SAS  &  GP  Deployments  

SAS  Grid  

SAS  In-­‐Database   SAS  In-­‐Memory  

SAS  Access,  Greenplum  database  

Page 25: High performance analytics sas greenplum sunz 2012

25  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS  Access,  Greenplum  database  

•  Provides integration capability to Greenplum

•  Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release

•  Products: SAS Access for Greenplum

•  libname myGP ODBC server=gplum04 db=customers port=5432 user=gpusr1 password=gppwd1;

Mul2ple  op2ons  for  SAS  &  GP  Deployments  

Page 26: High performance analytics sas greenplum sunz 2012

26  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS  In-­‐Database  

•  SAS Enterprise Miner models to execute within Greenplum database.

•  Automat ica l ly t rans la tes and publishes the model as a scoring function inside the database.

•  High-performance model scoring with faster time to results

•  Products: SAS Scoring Accelerator Note: Currently, this will be only available for Greenplum in the next version release of 9.3 slated for the end of this year.

In-Database Scoring In-Database Analytics

Mul2ple  op2ons  for  SAS  &  GP  Deployments  

Page 27: High performance analytics sas greenplum sunz 2012

27  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS  In-­‐Database  

In-Database Scoring In-Database Analytics

•  Execution of key SAS analytical, data discovery and data summarization tasks in database.

•  Reduces the time needed to build, execute and deploy powerful predictive models.

•  Improve data governance on predictive analytics projects and produce faster, better results.

•  Products: SAS Analytics Accelerator

Note: Currently, this is in Roadmap for Greenplum will be available with SAS future versions

Mul2ple  op2ons  for  SAS  &  GP  Deployments  

Page 28: High performance analytics sas greenplum sunz 2012

28  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS  Grid  

•  SAS running on a cluster of servers for better performance

•  This can provide some acceleration on the base procs with Greenplum as the database, as it allows the database to make use of parallel processing

•  Products: SAS Access for Greenplum

Mul2ple  op2ons  for  SAS  &  GP  Deployments  

Page 29: High performance analytics sas greenplum sunz 2012

29  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS  In-­‐Memory  

•  This is a complete 'big data' stack offering fast-loading, robust data management and complex analytics in a purpose-built environment.

•  Very high performance for business users that can significantly increase revenues or decrease costs as a result of improved performance

•  Products: GP & SAS HPA Note: Available in Q4 2011

Mul2ple  op2ons  for  SAS  &  GP  Deployments  

Page 30: High performance analytics sas greenplum sunz 2012

30  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS  /  Greenplum  Product  Overview    

SAS High Performance Computing

SAS Access

Provides integration capability to a number of databases

Allows for increased performance of Base SAS Procs when using the latest SAS v 9.3 release

Products: SAS Access for Greenpum

SAS Grid

Utilized to run SAS on a grid of commodity servers instead of large UNIX or Mainframe

Limited impact to SAS jobs and users, but simplified operations. Generally uses more CPUs for improved performance

Products: SAS Access Greenplum, SAS Grid

SAS In-Database

Allows certain models to be pushed into the database for execution. Requires SAS Enterprise Miner in order to be of utilized

Will lead to significant (20x or more) improvement in performance versus non-database deployments

Products: SAS Access for Greenplum, SAS Grid, SAS Enterprise Miner, SAS Scoring Accelerator for Greenplum

SAS In-Memory (HPA)

New functionality from SAS that requires dedicated database appliance

Very high performance for business users that can signficantly increase revenues or decrease costs as a result of improved performance

Products: SAS Access for Greenplum, SAS Grid, SAS High Performance Analytics

Page 31: High performance analytics sas greenplum sunz 2012

31  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

In-Database Roadmap for Greenplum

Greenplum SAS Product Capability

Status

Base SAS® Descriptive Statistics / Query and Reporting – SQL Pushdown

Available in 2011 Q4 (9.3 M)

SAS/Access® Interface Database Specific Integration and Connectivity

Available

Support for SAS Format Function Available in 2011 Q4 (9.3 M) SAS® Data Integration

Studio Data Extraction, Load and

transformation Available

SAS® Scoring Accelerator*

Production Batch Scoring / Real Time Scoring

Available in 2011 Q4 (9.3 M)

Page 32: High performance analytics sas greenplum sunz 2012

32  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

What is SAS High Performance Analytics for GP?

•  It’s software (GP DB, SAS HPA) •  It combines parallel execution with in-memory •  It allows large volumes of data to be handled

quickly • A select set of procedures from following SAS

products: Base SAS, SAS/STAT, SAS/ETS, SAS/OR and SAS Enterprise Miner.

Page 33: High performance analytics sas greenplum sunz 2012

33  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Why is GP & SAS and good match??

ü Greenplum & SAS already work well together via SAS|Access and the Scoring Accelerator

ü GP & SAS represent end-to-end analytics infrastructure, including rapid data loads, powerful ETL, parallel data computing for reports and analytics

ü Greenplum delivers extreme performance via the MPP architecture that is optimized for faster query execution and unmatched data loading

ü  Rapidly deployable and designed for massive growth ü  SAS & GP are working to develop advanced solutions with

deeper connectivity this solution will represent state of art in high performance, scalable, advanced analytics

Page 34: High performance analytics sas greenplum sunz 2012

34  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Some Greenplum Big Data References • 

•  The Greenplum Database supports up to 2^48 (2 to the power of 48) rows per table. One Greenplum customer – Fox Interactive Media has a trillion row fact table and is adding a further 3TB per day in a True mixed-workload environment supporting production reporting, ad-hoc data mining, and operational data services.

• 

•  Another On-line eCommerce client at last site visit had approximately 21TB in their Greenplum instance with 10 nodes. They load between 10-30 million rows a day but the issue is frequency and complexity rather than size. There are 2,000 Informatica workflows per day, complex hourly loads (up to 300 Greenplum loads per batch with 9,000 Greenplum loads every day)

• 

•  They have 5,000 tables, 350,000 columns 4,000 views, 1,600 indexes, relational and dimensional models, heavily relational/3NF as they had a legacy Teradata DW that Greenplum replaced. Hourly metadata/schema/table changes in response to the hourly data loads.

•  This Client is averaging around a million SQL statements per day. They have heavy spikes during peak hours and maintain a Cognos reporting SLA of 100k queries per hour. They have over 1000 Cognos users and 50% of the workload is Cognos; these are mostly small statements. 25% is financial reporting, 10% is CRM. The remaining 15% is ad-hoc by power users and analysts with lots of 25-50 slice significantly large queries (and up to 100 slices). They have dependent views to 4 levels of nesting: view (great-grandchild) -> view (grandchild) -> view (child) -> view -> table.

Page 35: High performance analytics sas greenplum sunz 2012

35  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Australian Tax Office uses Greenplum as an investigatory tool in their Compliance and Audit Logging Unit. They are an extremely happy reference customer referring to Greenplum's ability to pull in data from multiple sources and quickly analysis the data without needing to create complex data models or even indices.

31 © Copyright 2010 EMC Corporation. All rights reserved.

Some SAS & Greenplum Customers (some) RWS, in Singapore used MS SQL server as their reporting environment. Their reporting & ETL process were

very slow and the DWH environment is limited in terms of scalability. They were looking for an in-database platform that can work with SAS. We won in a competitive PoC last quarter and is being currently implemented. They will be using GP & SAS as EDW to store and analyze the customer trends AIS, a Telco in Thailand migrated a Teradata DWH as well as 2 Oracle DWHs onto a single Greenplum cluster

demonstrating the schema independence of the Database. The system has expanded to 70 TB across 32 Servers. AIS using SAS as their analytical platform.

Inland Revenue Service was running on Oracle DWH and had problems with Analytical report processing time. We won this deal in Q3 and is currently in the implementation phase.

Samsung Life Insurance had a 50TB Sybase DWH that they had spent 8 years building. They ran out of performance but were able to migrate the entire environment to Greenplum in 3 months. They had approx. 400,000 reports across 4 tools (SAS, Webfocus, MSTR, OLAP) only about 100 required tuning.

Page 36: High performance analytics sas greenplum sunz 2012

36  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division 12

Greenplum Customers -- Government

•  Pacific Northwest National Labs (Dept. of Energy) does cyberanalytics.

•  Usa spending.gov traces the outlays of the US Federal Government.

•  The Federal Reserve Bank of Kansas City does economic analysis mostly related to the housing market.

•  Recently, the Internal Revenue Service purchased a DCA to do work related to Fraudulent Tax returns.

•  ATO uses GP as an investigatory tool in their Compliance and Audit Logging Unit.

Page 37: High performance analytics sas greenplum sunz 2012

37  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

SAS AND EMC GREENPLUM INTEGRATED ARCHITECTURE

Data Scientist

Data Engineer

Data Analyst

Bl Analyst

LOB User

Data Platform Admin

DAT

A S

CIE

NC

E T

EA

M

Greenplum Chorus - Analytic Productivity Layer

SAS Analytics

Private/Hybrid Cloud Infrastructure or Appliance

SAS Business Intelligence

SAS Information Management

Greenplum Database Greenplum Hadoop

Data Access & Query Layer

Page 38: High performance analytics sas greenplum sunz 2012

38  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

High Performance Analytics

‘The power to know fast’

Page 39: High performance analytics sas greenplum sunz 2012

39  ©  Copyright  2011  EMC  Corpora2on.  All  rights  reserved.  

Data Computing Division

Questions?