Scaling AWS Redshift Concurrency with Postgres€¦ · Scaling AWS Redshift Concurrency with...
Transcript of Scaling AWS Redshift Concurrency with Postgres€¦ · Scaling AWS Redshift Concurrency with...
Scaling AWS Redshift Concurrency with PostgresBy Elliott Cordo, Will Liu, Paul Singman
Integrated luxury and lifestyle company with offerings centered on movement, nutrition, and regeneration
we operate more than 200 locations within every major city across the country in addition to London and Canada
Analytics Overview
1. Extract data from source systems
2. Transform raw datainto useful metrics
3. Analyze, report, andvisualize
ETL
Data Warehouse Reporting & Analytics AppsThird Party IntegrationsML Modeling & Insights
DB DB
CRM Clickstream
ERP API
Finance FTP
Why A Data Warehouse?
• Prevent data discrepancies
• Lower employee learning curve
• Avoid duplicating logic in multiple systems
• Isolate production DBs from analytic workloads
Presents a single point of failure so we created a data replication failover procedure
Data Ecosystem @ Equinox
Maximilian
1. Extract data from source systems
Data Ecosystem @ Equinox
AWS Redshift
2. Transform raw datainto useful metrics
Maximilian
Redshift Warehouse Structure (J.A.R.V.I.S.)
SQL
Raw Landing Tables
Big piles of SQL
Fact & Dimension Tables
d_facility
d_membership
f_checkin
Smaller piles of SQL
SQLSQLSQL
Data Marts
Member Profile
Group Fitness
Retail
Data Ecosystem @ Equinox
AWS Redshift
2. Transform raw datainto useful metrics
Maximilian
Data Ecosystem @ Equinox
Equinox AppsReporting/DashboardsRecommender SystemsInternal APIsAd-hoc Analysis3rd Party Data Integrations
3. Analyze, report, andvisualize
Maximilian AWS Redshift
Data Ecosystem @ Equinox
Equinox AppsReporting/DashboardsRecommender SystemsInternal APIsAd-hoc Analysis3rd Party Data Integrations
AWS RedshiftMaximilian
Data Ecosystem @ Equinox
Equinox AppsReporting/DashboardsRecommender SystemsInternal APIsAd-hoc analysis3rd party data integrations
AWS RedshiftMaximilian
Key Features of Redshift
Key Features of Redshift
• Released by AWS in early 2013, based on Postgres v8.0.2
Key Features of Redshift
• Column-oriented storage
Key Features of Redshift
• Immutable 1MB block storage
Key Features of Redshift
• Massively-parallel processing compute engine
Key Features of Redshift
• Native multi-node (leader + workers) architecture
Key Features of Redshift
• Distribution key & sort key table settings
Key Features of Redshift
• Workload Management queue settings
Key Features of Redshift
• Released by AWS in early 2013, based on Postgres v8.0.2
• Column-oriented storage
• Immutable 1MB block storage
• Massively-parallel processing compute engine
• Native multi-node (leader + workers) architecture
• Distribution key & sort key table settings
• Workload Management queue settings
• Petabyte-scale disk storage
• Batch insertions & retrieval
• Complex computations
• High-frequency transactions
• Concurrent user connections
• Petabyte-scale disk storage
• Batch insertions & retrieval
• Complex computations
10x - 100x slower on simple SELECT500 connection limit (per cluster)50 connection limit (per user-defined queue)
• High-frequency transactions
• Concurrent user connections
• Petabyte-scale disk storage
• Batch insertions & retrieval
• Complex computations
Warehouse Consumers
AWS Redshift
Equinox Apps
Recommender System
Internal APIs
Ad-hoc Analytics
Reporting/Dashboards
Warehouse Consumers
AWS Redshift
Reporting/Dashboards *
Problem
AWS Redshift
Sales Reporting Architecture
Moso
dbo.agreements
dbo.sales
raw_agreements
raw_ sales
~10 min
membership_sales
AWS Redshift Moso
dbo.agreements
dbo.sales
raw_agreements
raw_ sales
~10 min
membership_sales
Sales Reporting Architecture Problem
10x - 100x slower on simple SELECT500 connection limit (per cluster)50 connection limit (per user-defined queue)
Potential Solutions
Potential Solutions
1. Pull from source DB
AWS Redshift
Sales Reporting Architecture
Moso
dbo.agreements
dbo.sales
raw_agreements
raw_ sales
~10 min
membership_sales
Pros
• Source DB is OLTP
Cons
Potential Solutions
1. Pull from source DB
• Sales logic is complex!• Burdens prod DB
Pros
Cons
2. Cache in Redis
Potential Solutions
1. Pull from source DB
• Sales logic is complex!• Burdens prod DB
• Source DB is OLTP
AWS Redshift
Sales Reporting Architecture
raw_agreements
raw_ sales
membership_sales
Pros
Cons
2. Cache in Redis
Pros
• Very fast performance
Cons
Potential Solutions
1. Pull from source DB
• Sales logic is complex!• Burdens prod DB
• Non-relational data structure• Keys must be created for
every data view
• Source DB is OLTP
Pros
Cons
Pros
• Very fast performance
Cons
Potential Solutions
2. Cache in Redis1. Pull from source DB 3. Copy to another DB
• Sales logic is complex!• Burdens prod DB
• Non-relational data structure• Keys must be created for
every data view
• Source DB is OLTP
AWS Redshift
membership_sales
Sales Reporting Architecture
raw_agreements
raw_ sales
1. Pull from source DB
Pros
Cons• Sales logic is complex!• Burdens prod DB
Pros• Very fast performance
Cons• Non-relational data structure• Keys must be created for
every data view
3. Copy to another DB
Pros
Cons• Additional ETL step
Potential Solutions
2. Cache in Redis
• Maintain relational format• Source DB is OLTP
1. Pull from source DB
Pros
Cons
Pros• Very fast performance
Cons
3. Copy to another DB
Pros
Cons
Potential Solutions
2. Cache in Redis
• Sales logic is complex!• Burdens prod DB
• Non-relational data structure• Keys must be created for
every data view
• Additional ETL step
• Maintain relational format• Source DB is OLTP
PostgreSQL Foreign Data Wrapper
PostgreSQL Foreign Data Wrapper
https://aws.amazon.com/blogs/big-data/join-amazon-redshift-and-amazon-rds-postgresql-with-dblink
AWS Redshift
Proposed Sales Reporting Architecture
raw_agreements
raw_ sales
PostgreSQL
membership_salesmembership_ sales_mv
AWS Redshift
Proposed Sales Reporting Architecture
raw_agreements
raw_ sales
PostgreSQL
membership_salesmembership_ sales_mv
Environment Set-up
✓Create Redshift cluster
✓Create PostgreSQL server (9.5+)
• RDS recommended
• For self-managed, install Postgres contrib package:• sudo yum install postgresql10-contrib.x86_64
✓Networking (AWS)
• Co-locate in same Availability Zone
• Configure Security Group
Creating the Link
--1 enable the required extensionsCREATE EXTENSION postgres_fdw;CREATE EXTENSION dblink;
--2 create the external serverCREATE SERVER jarvis
FOREIGN DATA WRAPPER postgres_fdwOPTIONS (host 'REDSHIFT_ENDPOINT', port '5439’,dbname 'REDSHIFT_DB_NAME', sslmode 'require');
--3 save redshift login to this external serverCREATE USER MAPPING FOR PG_USERNAME
SERVER Jarvis OPTIONS(user 'RS_USERNAME', password 'RS_PASSWORD’);
Running queries on PostgreSQL
SELECT *FROM dblink('jarvis', $REDSHIFT$
SELECTmember_sales_id,member_id,sales_action,sales_action_date
FROMrs_landing.raw_sales $REDSHIFT$)
AS sales_actions (member_sales_id varchar(50),member_id varchar(50),sales_action varchar(50),sales_action_date date
);
Leveraging a Materialized ViewCREATE MATERIALIZED VIEW pg.membership_sales_copy AS(
SELECT *FROM dblink('jarvis', $REDSHIFT$SELECT
member_sales_id,member_id,sales_action,sales_action_date
FROMrs.membership_sales $REDSHIFT$
) AS membership_sales_copy (member_sales_id varchar(50),member_id varchar(50),sales_action varchar(50),sales_action_date date
);
REFRESH materialized VIEW pg.membership_sales_copy;
AWS Redshift
membership_sales
Sales Reporting Architecture
raw_agreements
raw_ sales
dblink
PostgreSQL
membership_ sales_mv
Materialized View Roadblock
AWS Redshift
~10 mins
PostgreSQL
membership_sales_mv membership_sales
MillionsOf records
Change Data Capture for Large Tables
AWS Redshift
dblink
PostgreSQL
membership_sales
membership_sales_cdcmembership_sales_mv
membership_sales
--Step 1: Create staging table in Redshift with last few hours of sales actions--CREATE TABLE rs_landing.stage_sales_actionDELETE FROM rs.membership_sales_cdc
INSERT INTO rs.membership_sales_cdcSELECT member_sales_id, member_id, sales_action, sales_action_dateFROM rs.membership_salesWHERE date >= ' $[?from_date]';
--Step 2: Refresh materialized view in PostgresREFRESH materialized VIEW pg.membership_sales_mv;
--Step 3: Upsert logic to populate final table in Postgres from materialized view
--temp table to hold last batchDROP TABLE IF EXISTS cdc_sales;CREATE TEMP TABLE cdc_sales ASSELECT * FROM pg.membership_sales_mv;
--update changed records, member_sales_id as the key to identify a unique recordUPDATE pg.membership_sales msSET sa.member_id = s.member_id,
ms.sales_action = s.sales_action,ms.sales_action_date = s.sales_action_date
FROM cdc_sales sWHERE s.member_sales_id = ms.member_sales_id;
--delete the records we just updated from temp tableDELETE FROM cdc_sales s USING pg.membership_sales msWHERE s.member_sales_id = ms.member_sales_id;
--insert new records not found in membership_salesINSERT INTO pg.membership_salesSELECT * FROM cdc_sales;
--drop temp tableDROP TABLE cdc_sales;
Redshift
PostgreSQL
Solution
AWS Redshift
membership_sales_cdc
Sales Reporting Architecture
dblink
PostgreSQL
membership_salesmembership_sales_mvmembership_sales
AWS Redshift
membership_sales_cdc
Sales Reporting Architecture
dblink
PostgreSQL
membership_salesmembership_sales_mvmembership_sales
Solution
Learnings & Notes
• Minimal maintenance on Postgres instance
• Won’t reflect source deletions
• Limited to a few tables
• Flexible for schema evolution
Looking to the Future…
• Make use of foreign table
• Front-end scaling with read-replicas
• Extensible to other datastores
• Event-based streaming architecture
More to Explore
Q&A
We Are Hiring
Email [email protected]
• Head of Engineering
• React Native Engineer
• Sr. React Native Engineer
• API Engineer
• SDET Java - Architect
• SDET Javascript - Architect
• SDET Java
• SDET Javascript
• Sr. UX Researcher
Equinox Tech Blog http://tech.equinox.com/