Uses and Best Practices for Amazon Redshift

45
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Uses & Best Practices for Amazon Redshift Rahul Pathak, AWS (rahulpathak@) Daniel Mintz, Upworthy (danielmintz@) July 10, 2014

description

In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.

Transcript of Uses and Best Practices for Amazon Redshift

Page 1: Uses and Best Practices for Amazon Redshift

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Uses & Best Practices

for Amazon Redshift

Rahul Pathak, AWS (rahulpathak@)

Daniel Mintz, Upworthy (danielmintz@)

July 10, 2014

Page 2: Uses and Best Practices for Amazon Redshift

Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year

Amazon Redshift

Page 3: Uses and Best Practices for Amazon Redshift

Redshift

EMR

EC2

Analyze

Glacier

S3

DynamoDB

Store

Direct Connect

Collect

Kinesis

Page 4: Uses and Best Practices for Amazon Redshift

Petabyte scale

Massively parallel

Relational data warehouse

Fully managed; zero admin

Amazon

Redshift

a lot faster

a lot cheaper

a whole lot simpler

Page 5: Uses and Best Practices for Amazon Redshift

Common Customer Use Cases

• Reduce costs by extending DW rather than adding HW

• Migrate completely from existing DW systems

• Respond faster to

business

• Improve performance by an order of magnitude

• Make more data available for analysis

• Access business data via standard reporting tools

• Add analytic functionality to applications

• Scale DW capacity as demand grows

• Reduce HW & SW costs by an order of magnitude

Traditional Enterprise DW Companies with Big Data SaaS Companies

Page 6: Uses and Best Practices for Amazon Redshift

Amazon Redshift Customers

Page 7: Uses and Best Practices for Amazon Redshift

Growing Ecosystem

Page 8: Uses and Best Practices for Amazon Redshift

AWS Marketplace

• Find software to use with Amazon Redshift

• One-click deployments

• Flexible pricing options

http://aws.amazon.com/marketplace/redshift

Page 9: Uses and Best Practices for Amazon Redshift

Data Loading Options

• Parallel upload to Amazon S3

• AWS Direct Connect

• AWS Import/Export

• Amazon Kinesis

• Systems integrators

Data Integration Systems Integrators

Page 10: Uses and Best Practices for Amazon Redshift

Amazon Redshift Architecture• Leader Node

– SQL endpoint

– Stores metadata

– Coordinates query execution

• Compute Nodes– Local, columnar storage

– Execute queries in parallel

– Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH

• Two hardware platforms– Optimized for data processing

– DW1: HDD; scale from 2TB to 1.6PB

– DW2: SSD; scale from 160GB to 256TB

10 GigE

(HPC)

IngestionBackupRestore

JDBC/ODBC

Page 11: Uses and Best Practices for Amazon Redshift

Amazon Redshift Node Types

• Optimized for I/O intensive workloads

• High disk density

• On demand at $0.85/hour

• As low as $1,000/TB/Year

• Scale from 2TB to 1.6PB

DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage

DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan

rate

• High performance at smaller storage size

• High compute and memory density

• On demand at $0.25/hour

• As low as $5,500/TB/Year

• Scale from 160GB to 256TB

DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage

DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage

Page 12: Uses and Best Practices for Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • With row storage you do

unnecessary I/O

• To get total amount, you have

to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 13: Uses and Best Practices for Amazon Redshift

• With column storage, you

only read the data you need

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Page 14: Uses and Best Practices for Amazon Redshift

analyze compression listing;

Table | Column | Encoding

---------+----------------+----------

listing | listid | delta

listing | sellerid | delta32k

listing | eventid | delta32k

listing | dateid | bytedict

listing | numtickets | bytedict

listing | priceperticket | delta32k

listing | totalprice | mostly32

listing | listtime | raw

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage • COPY compresses automatically

• You can analyze and override

• More performance, less cost

Page 15: Uses and Best Practices for Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and

maximum value for each block

• Skip over blocks that don’t

contain relevant data

Page 16: Uses and Best Practices for Amazon Redshift

Amazon Redshift dramatically reduces I/O

• Column storage

• Data compression

• Zone maps

• Direct-attached storage

• Use local storage for

performance

• Maximize scan rates

• Automatic replication and

continuous backup

• HDD & SSD platforms

Page 17: Uses and Best Practices for Amazon Redshift

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 18: Uses and Best Practices for Amazon Redshift

• Load in parallel from Amazon S3,

Amazon EMR, Amazon DynamoDB

or any SSH connection

• Data automatically distributed and

sorted according to DDL

• Scales linearly with number of nodes

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 19: Uses and Best Practices for Amazon Redshift

• Backups to Amazon S3 are automatic,

continuous and incremental

• Configurable system snapshot retention period.

Take user snapshots on-demand

• Cross region backups for disaster recovery

• Streaming restores enable you to resume

querying faster

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 20: Uses and Best Practices for Amazon Redshift

• Resize while remaining online

• Provision a new cluster in the

background

• Copy data in parallel from node to node

• Only charged for source cluster

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 21: Uses and Best Practices for Amazon Redshift

• Automatic SQL endpoint

switchover via DNS

• Decommission the source cluster

• Simple operation via Console or

API

Amazon Redshift parallelizes and distributes everything

• Query

• Load

• Backup/Restore

• Resize

Page 22: Uses and Best Practices for Amazon Redshift

Amazon Redshift is priced to let you analyze all your data

• Number of nodes x cost

per hour

• No charge for leader

node

• No upfront costs

• Pay as you go

DW1 (HDD)Price Per Hour for

DW1.XL Single Node

Effective Annual

Price per TB

On-Demand $ 0.850 $ 3,723

1 Year Reservation $ 0.500 $ 2,190

3 Year Reservation $ 0.228 $ 999

DW2 (SSD)Price Per Hour for

DW2.L Single Node

Effective Annual

Price per TB

On-Demand $ 0.250 $ 13,688

1 Year Reservation $ 0.161 $ 8,794

3 Year Reservation $ 0.100 $ 5,498

Page 23: Uses and Best Practices for Amazon Redshift

Amazon Redshift is easy to use

• Provision in minutes

• Monitor query performance

• Point and click resize

• Built in security

• Automatic backups

Page 24: Uses and Best Practices for Amazon Redshift

Amazon Redshift has security built-in

• SSL to secure data in transit

• Encryption to secure data at rest– AES-256; hardware accelerated

– All blocks on disks and in Amazon S3 encrypted

– HSM Support so you control keys

• Audit logging & AWS CloudTrailintegration

• Amazon VPC support

• SOC 1/2/3, PCI DSS Level 1, Fedrampand more

10 GigE

(HPC)

Ingestion

Backup

Restore

Customer VPC

Internal

VPC

JDBC/ODBC

Page 25: Uses and Best Practices for Amazon Redshift

Amazon Redshift continuously backs up your

data and recovers from failures

• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of

data at all times

• Backups to Amazon S3 are continuous, automatic, and incremental

– Designed for eleven nines of durability

• Continuous monitoring and automated recovery from failures of drives and nodes

• Able to restore snapshots to any Availability Zone within a region

• Easily enable backups to a second region for disaster recovery

Page 26: Uses and Best Practices for Amazon Redshift

60+ new features since launch• Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney

• Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others

• Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE

for perfect forward security

• Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region backups

• Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50

• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution,

CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR

Page 27: Uses and Best Practices for Amazon Redshift

Amazon Redshift Feature Delivery

Service Launch (2/14)

PDX (4/2)

Temp Credentials (4/11)

DUB (4/25)

SOC1/2/3 (5/8)

Unload Encrypted Files

NRT (6/5)

JDBC Fetch Size (6/27)

Unload logs (7/5)

SHA1 Builtin (7/15)

4 byte UTF-8 (7/18)

Sharing snapshots (7/18)

Statement Timeout (7/22)

Timezone, Epoch, Autoformat (7/25)

WLM Timeout/Wildcards (8/1)

CRC32 Builtin, CSV, Restore Progress (8/9)

Resource Level IAM (8/9)

PCI (8/22)

UTF-8 Substitution (8/29)

JSON, Regex, Cursors (9/10)

Split_part, Audit tables (10/3)

SIN/SYD (10/8)

HSM Support (11/11)

Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit

Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count

Distinct, SNS Alerts, Cross Region Backup (11/13)

Distributed Tables, Single Node Cursor Support, Maximum Connections to 500

(12/13)

EIP Support for VPC Clusters (12/28)

New query monitoring system tables and diststyle all (1/13)

Redshift on DW2 (SSD) Nodes (1/23)

Compression for COPY from SSH, Fetch size support for single node

clusters, new system tables with commit stats, row_number(), strotol() and query

termination (2/13)

Resize progress indicator & Cluster Version (3/21)

Regex_Substr, COPY from JSON (3/25)

50 slots, COPY from EMR, ECDHE ciphers (4/22)

3 new regex features, Unload to single file, FedRAMP(5/6)

Rename Cluster (6/2)

Copy from multiple regions, percentile_cont, percentile_disc (6/30)

Free Trial (7/1)

Page 28: Uses and Best Practices for Amazon Redshift

New Features

• UNLOAD to single file

• COPY from multiple regions

• Percentile_cont & percentile_disc window

functions

Page 29: Uses and Best Practices for Amazon Redshift

Try Amazon Redshift with BI & ETL for Free!

• http://aws.amazon.com/redshift/free-trial

• 2 months, 750 hours/month of free DW2.Large usage

• Also try BI & ETL for free from nine partners

Page 30: Uses and Best Practices for Amazon Redshift

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

A Year With Amazon Redshift

Why Upworthy Chose Redshift And How It’s Going

July 10, 2014

Page 31: Uses and Best Practices for Amazon Redshift

What’s Upworthy

• We’ve been called:– “Social media with a mission” by our About Page

– “The fastest growing media site of all time” by Fast Company

– “The Fastest Rising Startup” by The Crunchies

– “That thing that’s all over my newsfeed” by my annoyed friends

– “The most data-driven media company in history” by me,

optimistically

Page 32: Uses and Best Practices for Amazon Redshift

What We Do

• We aim to drive massive amounts of attention to things that really matter.

• We do that by finding, packaging, and distributing great, meaningful content.

Page 33: Uses and Best Practices for Amazon Redshift

Our Use Case

Page 34: Uses and Best Practices for Amazon Redshift

When We Started

• Building a data warehouse from scratch

• One engineer on the project

• Object data in MongoDB

• Had discovered MoSQL

• Knew which two we’d choose:– Comprehensive

– Ad Hoc

– Real-Time

Page 35: Uses and Best Practices for Amazon Redshift

The Decision

• Downsides to self-hosting– Cost

– Maintenance

• Tried Redshift by the hour

Page 36: Uses and Best Practices for Amazon Redshift

Building it out initially• ~50 events/second

• Snowflake or denormalized? Both.

Raw

S3

Events

Redshift

MongoDB

Browser Web Service S3 Drain

MoSQL PostgreSQL Objects

Master

EMR

Processed

Objects

Page 37: Uses and Best Practices for Amazon Redshift

Our system now

• Stats:– ~5 TB of compressed data.

– Two main tables = 13 billion rows

– Average: ~1085 events/second

– Peak: ~2500 events/second

• 5-10 minute ETL cycle (Kinesis session later)

• Lots of rollup tables

• COPY happens quickly (5-10s) every 1-2 minutes

Page 38: Uses and Best Practices for Amazon Redshift

What We’ve Learned

Page 39: Uses and Best Practices for Amazon Redshift

Initial Lessons

• Columnar can be disorienting:

– What’s in the SELECT matters. A lot.

– SELECT COUNT(*) is really, really fast.

– SELECT * is really, really slow.

– Wide, tall tables with lots of NULLs = Fine.

• Sortkeys are hugely powerful.

• Bulk Operations (COPY, not INSERT) FTW

Page 40: Uses and Best Practices for Amazon Redshift

Staging to Master PatternINSERT INTO master

WITH to_insert as (

SELECT

col1,

col2,

col3

FROM

staging

)

SELECT

s.*

FROM

to_insert s

LEFT JOIN master m on m.col1 = s.col1

WHERE

m.col1 is NULL;

Page 41: Uses and Best Practices for Amazon Redshift

Hash multiple join keys into one...

FROM

table1 t1

LEFT JOIN table2 t2 on

t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3

md5(col1 || col2 || col3) as hashed_join_key

...

FROM

table1 t1

LEFT JOIN table2 t2 on t1.hashed_join_key = t2.hashed_join_key

Page 42: Uses and Best Practices for Amazon Redshift

Pain Points

• Data loading can be a bit tricky at first– Actually read the documentation. It’ll help.

• COUNT(DISTINCT col1) can be painful– Use APPROXIMATE COUNT(DISTINCT col1) instead

• No null-safe operator. NULL = NULL returns NULL– NVL(t1.col1, 0) = NVL(t2.col1, 0) is null-safe

• Error messages can be less than informative– AWS Redshift Forum is fantastic. Someone else has probably seen that error

before you.

Page 43: Uses and Best Practices for Amazon Redshift

Where We’re Going Next

Page 44: Uses and Best Practices for Amazon Redshift

In The Next Few Months

• More structure to our ETL (implementing luigi)

• Need to avoid Serializable Isolation Violations

• Dozens of people querying a cluster of 4 dw1.xlarge

nodes can get ugly

• “Write” node of dw1 nodes and “Read” cluster of dw2

“dense compute” nodes

Page 45: Uses and Best Practices for Amazon Redshift

Thank You!