AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon...

23
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ryan Oattes, Enterprise Architect, GE Adam Gantt, Solution Architect, Matillion November 29, 2016 BDA203 How GE Transformed Billions of Rows in Record Time Using Matillion ETL for Amazon Redshift

Transcript of AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon...

Page 1: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ryan Oattes, Enterprise Architect, GE

Adam Gantt, Solution Architect, Matillion

November 29, 2016

BDA203

How GE Transformed

Billions of Rows in Record Time

Using Matillion ETL for Amazon Redshift

Page 2: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Speakers

Ryan Oattes,

Enterprise Architect, GE Digital

Adam Gantt,

Solution Architect, Matillion

Page 3: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

What to Expect from the Session

• Understand GE’s challenge and use case

• Explore the technical architecture: Amazon Redshift and Matillion

ETL

• ELT approach

• The role of AWS Marketplace

• Lessons learned and tips

• Deep dive technical demo of Matillion ETL for Amazon Redshift

• Share experiences, benefits and lessons learned

• Technical Q&A

Page 4: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Background and Requirement

Page 5: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

GE, our company and requirement

• General Electric has been in business for over 140

years, investing $5.4B annually in R&D (6% of revenue)

• Augmenting our operational technology depth with digital

• Focus on industrial Internet of Things, and creating

insights based on business and machine data

• Knew we needed best-in-class partners to let us focus

on what we do best; GE migrating 9000 workloads into

AWS

Page 6: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Our Challenge

• Raise the bar on data warehouse scalability,

integration, stability, and development velocity• Needed scalability for machine and business data, as GE

increasingly digitizes

• Self-serve BI strategy meant we had to maintain our current

compute capabilities

• Increasingly critical dependencies require rock-solid platform

• Desired more intuitive and accessible analytics solution

Page 7: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Technical Journey

• Had an on-premises, in-memory columnar database

• Good, but hard to scale

• We’re getting out of the business of managing infrastructure

• Selected Amazon Redshift to replace the on-premises solution

• Fully managed & scalable

• Familiar SQL technology

• Our ETL technology previously was a traditional enterprise ETL platform

• Tried using it for our Amazon Redshift project but wasn’t working for us

• Lots of manual SQL coding required

• Deployment, management, scaling, etc. was hard, as it wasn’t cloud 1st

• Wanted a “Cloud 1st” solution, ideally built for Amazon Redshift

• An AWS SA recommended we look at Matillion

• AWS Marketplace allowed us to PoC Matillion quickly and cost effectively

Page 8: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Solution Architecture

Source data: SAP

Data Warehouse:

• Amazon Redshift

Data Integration:

• Matillion ETL for

Amazon Redshift

• HVR

Data Visualisation: Tableau

SAP

32 x DC1 Nodes

Amazon Redshift Cluster

Staging DWH

Matillion ETL

M3.Large

ELT

Tableau

CDC Data Replication (HVR)

Page 9: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

ELT Approach

Amazon Redshift’s MPP columnar architecture

is very fast at transforming data

Matillion ETL uses a push-down architecture,

transforming (join/aggregate/filter/calculate etc)

data on the Amazon Redshift cluster directly

This simplifies our infrastructure and scaling

and the speed and proximity to the data helps

developer productivity

Same architecture can be achieved manually

(coding), but not as productively as with a tool

Page 10: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Transform goals

Solution Details – the Transforms

One of our transformation jobs

• Denormalise complex underlying

SAP data structures → analysis

ready Facts and Dimensions

• “Clean up” data, ironing out, for

instance, differences between

configurations in business

units/geographies

• Add metrics, KPIs and measures

for business to consume (and do

so consistently)

Page 11: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Transform goals (cont.)

Solution Details – the Transforms

Transform detail

• Use of ETL tool required to manage

the complexity of the transforms

• Graphical jobs help document,

understand and share the business

logic

• Full range of transforms used, e.g.,

join, aggregate, filter, calculate,

rename, convert type, map values,

rank, etc.

• ELT architecture means you can see

data live in the job at each component,

significantly streamlining development

Page 12: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Solution Details – Volumes & KPIs

• Approx. 4 TB across 32 x Amazon

Redshift compute intensive nodes

• Transaction business data from

SAP

• Financial transactions, inventory

movements, sales orders, etc.

• Data is staged to Amazon Redshift

via CDC (HVR)

• Then micro-batched into

warehouse. Transformed then

upserted using ‘Table Update’

component (set to update/insert

strategy) in Matillion ETL

• Under the hood, this is doing

an insert for any new rows,

and an update for any existing

rows

• Runs every hour using schedule in

Matillion scheduler

Page 13: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Deep Dive:

Hands On Demonstration

Page 14: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Lessons learned and tips

Some of the jobs require very large, complex joins across

billions of rows

• In a Matillion job, each component builds an underlying Amazon

Redshift view. Amazon Redshift optimises the collection of views

into a performant execution plan

• To improve performance on the largest joins, we split the jobs up

into sub-jobs, outputting to a table between each sub-job

• Distribution and sort keys of these intermediary tables were set to

optimise the subsequent join (using the Table Output component in

Matillion)

Page 15: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Lessons learned and tips

We record the last successful

time of an upload in a state table

in Amazon Redshift

• The last time of a transform is

read from the table

• Data is processed from after that

time only, then the table is

updated

• This is wrapped in transaction

control, using the Matillion

transaction control orchestration

components

SAP uses four character names

for fields and tables

• We use the rename column

component Matillion, in “Text

Mode”, to allow us to copy and

paste a spreadsheet of mappings

of the 4-char names to human

readable names easily

• A useful hack which saved hours

Page 16: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Lessons learned - AWS Marketplace

AWS Marketplace genuinely added value in

both PoC/procurement and also

architecturally

• Took away any security issues as it’s an AMI

running in our own VPC

• Same goes for performance – no data moving

across the internet

• Allowed us to try out several tools quickly and

pick the best fit

• Supported GE’s “fail fast” ethos

• Made purchasing and therefore project launch

faster and simpler – no small thing in a

company the size of GE

Page 17: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Outcomes

Page 18: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Key Numbers

45%operating cost

reduction

6 months from

ideation to go-live

<$100PoC pilot cost

Page 19: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Outcomes for GE

• Preserved performance and added stability

• Simplified operations with managed Amazon

Redshift solution

• OPEX savings. No CAPEX required

• Fast development from concept to reality

• Highly resilient, seamless scaling

Page 20: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Q & A

Page 21: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Resources

• Free 14-day trial:

• https://aws.amazon.com/marketplace/pp/B010ED5YF8

• Tutorials:

• https://www.youtube.com/user/MatillionVideos

• Support And Documentation:

• https://redshiftsupport.matillion.com

• Web:

• https://www.matillion.com

• Visit booth 2338 for a hands-on demo and AWS credits to support your

trial/PoC

Page 22: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Thank you!

Page 23: AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift (BDA203)

Remember to complete

your evaluations!