(ADV403) Dynamic Ad Perf. Reporting w/ Redshift: Data Science, Queries at Scale | AWS re:Invent 2014
AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon...
-
Upload
amazon-web-services -
Category
Technology
-
view
504 -
download
1
Transcript of AWS re:Invent 2016: Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon...
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ryan Oattes, Enterprise Architect, GE
Adam Gantt, Solution Architect, Matillion
November 29, 2016
BDA203
How GE Transformed
Billions of Rows in Record Time
Using Matillion ETL for Amazon Redshift
Speakers
Ryan Oattes,
Enterprise Architect, GE Digital
Adam Gantt,
Solution Architect, Matillion
What to Expect from the Session
• Understand GE’s challenge and use case
• Explore the technical architecture: Amazon Redshift and Matillion
ETL
• ELT approach
• The role of AWS Marketplace
• Lessons learned and tips
• Deep dive technical demo of Matillion ETL for Amazon Redshift
• Share experiences, benefits and lessons learned
• Technical Q&A
Background and Requirement
GE, our company and requirement
• General Electric has been in business for over 140
years, investing $5.4B annually in R&D (6% of revenue)
• Augmenting our operational technology depth with digital
• Focus on industrial Internet of Things, and creating
insights based on business and machine data
• Knew we needed best-in-class partners to let us focus
on what we do best; GE migrating 9000 workloads into
AWS
Our Challenge
• Raise the bar on data warehouse scalability,
integration, stability, and development velocity• Needed scalability for machine and business data, as GE
increasingly digitizes
• Self-serve BI strategy meant we had to maintain our current
compute capabilities
• Increasingly critical dependencies require rock-solid platform
• Desired more intuitive and accessible analytics solution
Technical Journey
• Had an on-premises, in-memory columnar database
• Good, but hard to scale
• We’re getting out of the business of managing infrastructure
• Selected Amazon Redshift to replace the on-premises solution
• Fully managed & scalable
• Familiar SQL technology
• Our ETL technology previously was a traditional enterprise ETL platform
• Tried using it for our Amazon Redshift project but wasn’t working for us
• Lots of manual SQL coding required
• Deployment, management, scaling, etc. was hard, as it wasn’t cloud 1st
• Wanted a “Cloud 1st” solution, ideally built for Amazon Redshift
• An AWS SA recommended we look at Matillion
• AWS Marketplace allowed us to PoC Matillion quickly and cost effectively
Solution Architecture
Source data: SAP
Data Warehouse:
• Amazon Redshift
Data Integration:
• Matillion ETL for
Amazon Redshift
• HVR
Data Visualisation: Tableau
SAP
32 x DC1 Nodes
Amazon Redshift Cluster
Staging DWH
Matillion ETL
M3.Large
ELT
Tableau
CDC Data Replication (HVR)
ELT Approach
Amazon Redshift’s MPP columnar architecture
is very fast at transforming data
Matillion ETL uses a push-down architecture,
transforming (join/aggregate/filter/calculate etc)
data on the Amazon Redshift cluster directly
This simplifies our infrastructure and scaling
and the speed and proximity to the data helps
developer productivity
Same architecture can be achieved manually
(coding), but not as productively as with a tool
Transform goals
Solution Details – the Transforms
One of our transformation jobs
• Denormalise complex underlying
SAP data structures → analysis
ready Facts and Dimensions
• “Clean up” data, ironing out, for
instance, differences between
configurations in business
units/geographies
• Add metrics, KPIs and measures
for business to consume (and do
so consistently)
Transform goals (cont.)
Solution Details – the Transforms
Transform detail
• Use of ETL tool required to manage
the complexity of the transforms
• Graphical jobs help document,
understand and share the business
logic
• Full range of transforms used, e.g.,
join, aggregate, filter, calculate,
rename, convert type, map values,
rank, etc.
• ELT architecture means you can see
data live in the job at each component,
significantly streamlining development
Solution Details – Volumes & KPIs
• Approx. 4 TB across 32 x Amazon
Redshift compute intensive nodes
• Transaction business data from
SAP
• Financial transactions, inventory
movements, sales orders, etc.
• Data is staged to Amazon Redshift
via CDC (HVR)
• Then micro-batched into
warehouse. Transformed then
upserted using ‘Table Update’
component (set to update/insert
strategy) in Matillion ETL
• Under the hood, this is doing
an insert for any new rows,
and an update for any existing
rows
• Runs every hour using schedule in
Matillion scheduler
Deep Dive:
Hands On Demonstration
Lessons learned and tips
Some of the jobs require very large, complex joins across
billions of rows
• In a Matillion job, each component builds an underlying Amazon
Redshift view. Amazon Redshift optimises the collection of views
into a performant execution plan
• To improve performance on the largest joins, we split the jobs up
into sub-jobs, outputting to a table between each sub-job
• Distribution and sort keys of these intermediary tables were set to
optimise the subsequent join (using the Table Output component in
Matillion)
Lessons learned and tips
We record the last successful
time of an upload in a state table
in Amazon Redshift
• The last time of a transform is
read from the table
• Data is processed from after that
time only, then the table is
updated
• This is wrapped in transaction
control, using the Matillion
transaction control orchestration
components
SAP uses four character names
for fields and tables
• We use the rename column
component Matillion, in “Text
Mode”, to allow us to copy and
paste a spreadsheet of mappings
of the 4-char names to human
readable names easily
• A useful hack which saved hours
Lessons learned - AWS Marketplace
AWS Marketplace genuinely added value in
both PoC/procurement and also
architecturally
• Took away any security issues as it’s an AMI
running in our own VPC
• Same goes for performance – no data moving
across the internet
• Allowed us to try out several tools quickly and
pick the best fit
• Supported GE’s “fail fast” ethos
• Made purchasing and therefore project launch
faster and simpler – no small thing in a
company the size of GE
Outcomes
Key Numbers
45%operating cost
reduction
6 months from
ideation to go-live
<$100PoC pilot cost
Outcomes for GE
• Preserved performance and added stability
• Simplified operations with managed Amazon
Redshift solution
• OPEX savings. No CAPEX required
• Fast development from concept to reality
• Highly resilient, seamless scaling
Q & A
Resources
• Free 14-day trial:
• https://aws.amazon.com/marketplace/pp/B010ED5YF8
• Tutorials:
• https://www.youtube.com/user/MatillionVideos
• Support And Documentation:
• https://redshiftsupport.matillion.com
• Web:
• https://www.matillion.com
• Visit booth 2338 for a hands-on demo and AWS credits to support your
trial/PoC
Thank you!
Remember to complete
your evaluations!