Download - Uses and Best Practices for Amazon Redshift

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Uses & Best Practices

for Amazon Redshift

Rahul Pathak, AWS (rahulpathak@)

Daniel Mintz, Upworthy (danielmintz@)

July 10, 2014

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Redshift

EMR

EC2

Analyze

Glacier

S3

DynamoDB

Store

Direct Connect

Collect

Kinesis

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Amazon

Redshift

a lot faster

a lot cheaper

a whole lot simpler

Common Customer Use Cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to

business

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW & SW costs by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Amazon Redshift Customers

Growing Ecosystem

AWS Marketplace

• Find software to use with Amazon Redshift

• One-click deployments

• Flexible pricing options

http://aws.amazon.com/marketplace/redshift

Data Loading Options

• Parallel upload to Amazon S3

• AWS Direct Connect

• AWS Import/Export

• Amazon Kinesis

• Systems integrators

Data Integration Systems Integrators

Amazon Redshift Architecture• Leader Node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH

• Two hardware platforms– Optimized for data processing

– DW1: HDD; scale from 2TB to 1.6PB

– DW2: SSD; scale from 160GB to 256TB

10 GigE

(HPC)

IngestionBackupRestore

JDBC/ODBC

Amazon Redshift Node Types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/Year

• Scale from 2TB to 1.6PB

DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage

DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan

rate

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/Year

• Scale from 160GB to 256TB

DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage

DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • With row storage you do

unnecessary I/O

• To get total amount, you have

to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• With column storage, you

only read the data you need


• Column storage


• Zone maps

• Direct-attached storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375


• Column storage


• Zone maps


10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and

maximum value for each block

• Skip over blocks that don’t

contain relevant data


• Column storage


• Zone maps


• Use local storage for

performance

• Maximize scan rates

• Automatic replication and

continuous backup

• HDD & SSD platforms

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

• Load in parallel from Amazon S3,

Amazon EMR, Amazon DynamoDB

or any SSH connection

• Data automatically distributed and

sorted according to DDL

• Scales linearly with number of nodes


• Query

• Load

•

• Backup/Restore

• Resize

• Backups to Amazon S3 are automatic,

continuous and incremental

• Configurable system snapshot retention period.

Take user snapshots on-demand

• Cross region backups for disaster recovery

• Streaming restores enable you to resume

querying faster


• Query

• Load

• Backup/Restore

• Resize

• Resize while remaining online

• Provision a new cluster in the

background

• Copy data in parallel from node to node

• Only charged for source cluster


• Query

• Load

• Backup/Restore

• Resize

• Automatic SQL endpoint

switchover via DNS

• Decommission the source cluster

• Simple operation via Console or

API


• Query

• Load

• Backup/Restore

• Resize

Amazon Redshift is priced to let you analyze all your data

• Number of nodes x cost

per hour

• No charge for leader

node

• No upfront costs

• Pay as you go

DW1 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB

On-Demand $ 0.850 $ 3,723

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DW2 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB

On-Demand $ 0.250 $ 13,688



Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Amazon Redshift has security built-in

• SSL to secure data in transit

• Encryption to secure data at rest– AES-256; hardware accelerated

– All blocks on disks and in Amazon S3 encrypted

– HSM Support so you control keys

• Audit logging & AWS CloudTrailintegration

• Amazon VPC support

• SOC 1/2/3, PCI DSS Level 1, Fedrampand more

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Amazon Redshift continuously backs up your

data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of

data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental

– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and nodes

• Able to restore snapshots to any Availability Zone within a region

• Easily enable backups to a second region for disaster recovery

60+ new features since launch• Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney

• Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others

• Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE

for perfect forward security

• Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region backups

• Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50

• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution,

CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR

Amazon Redshift Feature Delivery

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

DUB (4/25)

SOC1/2/3 (5/8)

Unload Encrypted Files

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

SHA1 Builtin (7/15)

4 byte UTF-8 (7/18)

Sharing snapshots (7/18)

Statement Timeout (7/22)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

Resource Level IAM (8/9)

PCI (8/22)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count

Distinct, SNS Alerts, Cross Region Backup (11/13)

Distributed Tables, Single Node Cursor Support, Maximum Connections to 500

(12/13)

EIP Support for VPC Clusters (12/28)

New query monitoring system tables and diststyle all (1/13)

Redshift on DW2 (SSD) Nodes (1/23)

Compression for COPY from SSH, Fetch size support for single node

clusters, new system tables with commit stats, row_number(), strotol() and query

termination (2/13)

Resize progress indicator & Cluster Version (3/21)

Regex_Substr, COPY from JSON (3/25)

50 slots, COPY from EMR, ECDHE ciphers (4/22)

3 new regex features, Unload to single file, FedRAMP(5/6)

Rename Cluster (6/2)

Copy from multiple regions, percentile_cont, percentile_disc (6/30)

Free Trial (7/1)

New Features

• UNLOAD to single file

• COPY from multiple regions

• Percentile_cont & percentile_disc window

functions

Try Amazon Redshift with BI & ETL for Free!

• http://aws.amazon.com/redshift/free-trial

• 2 months, 750 hours/month of free DW2.Large usage

• Also try BI & ETL for free from nine partners

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

A Year With Amazon Redshift

Why Upworthy Chose Redshift And How It’s Going

July 10, 2014

What’s Upworthy

• We’ve been called:– “Social media with a mission” by our About Page

– “The fastest growing media site of all time” by Fast Company

– “The Fastest Rising Startup” by The Crunchies

– “That thing that’s all over my newsfeed” by my annoyed friends

– “The most data-driven media company in history” by me,

optimistically

What We Do

• We aim to drive massive amounts of attention to things that really matter.

• We do that by finding, packaging, and distributing great, meaningful content.

Our Use Case

When We Started

• Building a data warehouse from scratch

• One engineer on the project

• Object data in MongoDB

• Had discovered MoSQL

• Knew which two we’d choose:– Comprehensive

– Ad Hoc

– Real-Time

The Decision

• Downsides to self-hosting– Cost

– Maintenance

• Tried Redshift by the hour

Building it out initially• ~50 events/second

• Snowflake or denormalized? Both.

Raw

S3

Events

Redshift

MongoDB

Browser Web Service S3 Drain

MoSQL PostgreSQL Objects

Master

EMR

Processed

Objects

Our system now

• Stats:– ~5 TB of compressed data.

– Two main tables = 13 billion rows

– Average: ~1085 events/second

– Peak: ~2500 events/second

• 5-10 minute ETL cycle (Kinesis session later)

• Lots of rollup tables

• COPY happens quickly (5-10s) every 1-2 minutes

What We’ve Learned

Initial Lessons

• Columnar can be disorienting:

– What’s in the SELECT matters. A lot.

– SELECT COUNT(*) is really, really fast.

– SELECT * is really, really slow.

– Wide, tall tables with lots of NULLs = Fine.

• Sortkeys are hugely powerful.

• Bulk Operations (COPY, not INSERT) FTW

Staging to Master PatternINSERT INTO master

WITH to_insert as (

SELECT

col1,

col2,

col3

FROM

staging

)

SELECT

s.*

FROM

to_insert s

LEFT JOIN master m on m.col1 = s.col1

WHERE

m.col1 is NULL;

Hash multiple join keys into one...

FROM

table1 t1

LEFT JOIN table2 t2 on

t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3

md5(col1 || col2 || col3) as hashed_join_key

...

FROM

table1 t1

LEFT JOIN table2 t2 on t1.hashed_join_key = t2.hashed_join_key

Pain Points

• Data loading can be a bit tricky at first– Actually read the documentation. It’ll help.

• COUNT(DISTINCT col1) can be painful– Use APPROXIMATE COUNT(DISTINCT col1) instead

• No null-safe operator. NULL = NULL returns NULL– NVL(t1.col1, 0) = NVL(t2.col1, 0) is null-safe

• Error messages can be less than informative– AWS Redshift Forum is fantastic. Someone else has probably seen that error

before you.

Where We’re Going Next

In The Next Few Months

• More structure to our ETL (implementing luigi)

• Need to avoid Serializable Isolation Violations

• Dozens of people querying a cluster of 4 dw1.xlarge

nodes can get ugly

• “Write” node of dw1 nodes and “Read” cluster of dw2

“dense compute” nodes

Thank You!