Facing Facts: Best Practices for Common Uses of Facial Recognition ...
Uses and Best Practices for Amazon Redshift
-
Upload
amazon-web-services -
Category
Technology
-
view
770 -
download
0
description
Transcript of Uses and Best Practices for Amazon Redshift
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Uses & Best Practices
for Amazon Redshift
Rahul Pathak, AWS (rahulpathak@)
Daniel Mintz, Upworthy (danielmintz@)
July 10, 2014
Fast, simple, petabyte-scale data warehousing for less than $1,000/TB/Year
Amazon Redshift
Redshift
EMR
EC2
Analyze
Glacier
S3
DynamoDB
Store
Direct Connect
Collect
Kinesis
Petabyte scale
Massively parallel
Relational data warehouse
Fully managed; zero admin
Amazon
Redshift
a lot faster
a lot cheaper
a whole lot simpler
Common Customer Use Cases
• Reduce costs by extending DW rather than adding HW
• Migrate completely from existing DW systems
• Respond faster to
business
• Improve performance by an order of magnitude
• Make more data available for analysis
• Access business data via standard reporting tools
• Add analytic functionality to applications
• Scale DW capacity as demand grows
• Reduce HW & SW costs by an order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
Amazon Redshift Customers
Growing Ecosystem
AWS Marketplace
• Find software to use with Amazon Redshift
• One-click deployments
• Flexible pricing options
http://aws.amazon.com/marketplace/redshift
Data Loading Options
• Parallel upload to Amazon S3
• AWS Direct Connect
• AWS Import/Export
• Amazon Kinesis
• Systems integrators
Data Integration Systems Integrators
Amazon Redshift Architecture• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via Amazon S3; load from Amazon DynamoDB or SSH
• Two hardware platforms– Optimized for data processing
– DW1: HDD; scale from 2TB to 1.6PB
– DW2: SSD; scale from 160GB to 256TB
10 GigE
(HPC)
IngestionBackupRestore
JDBC/ODBC
Amazon Redshift Node Types
• Optimized for I/O intensive workloads
• High disk density
• On demand at $0.85/hour
• As low as $1,000/TB/Year
• Scale from 2TB to 1.6PB
DW1.XL: 16 GB RAM, 2 Cores 3 Spindles, 2 TB compressed storage
DW1.8XL: 128 GB RAM, 16 Cores, 24 Spindles 16 TB compressed, 2 GB/sec scan
rate
• High performance at smaller storage size
• High compute and memory density
• On demand at $0.25/hour
• As low as $5,500/TB/Year
• Scale from 160GB to 256TB
DW2.L *New*: 16 GB RAM, 2 Cores, 160 GB compressed SSD storage
DW2.8XL *New*: 256 GB RAM, 32 Cores, 2.56 TB of compressed SSD storage
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • With row storage you do
unnecessary I/O
• To get total amount, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• With column storage, you
only read the data you need
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage • COPY compresses automatically
• You can analyze and override
• More performance, less cost
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and
maximum value for each block
• Skip over blocks that don’t
contain relevant data
Amazon Redshift dramatically reduces I/O
• Column storage
• Data compression
• Zone maps
• Direct-attached storage
• Use local storage for
performance
• Maximize scan rates
• Automatic replication and
continuous backup
• HDD & SSD platforms
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
• Load in parallel from Amazon S3,
Amazon EMR, Amazon DynamoDB
or any SSH connection
• Data automatically distributed and
sorted according to DDL
• Scales linearly with number of nodes
Amazon Redshift parallelizes and distributes everything
• Query
• Load
•
• Backup/Restore
• Resize
• Backups to Amazon S3 are automatic,
continuous and incremental
• Configurable system snapshot retention period.
Take user snapshots on-demand
• Cross region backups for disaster recovery
• Streaming restores enable you to resume
querying faster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
• Resize while remaining online
• Provision a new cluster in the
background
• Copy data in parallel from node to node
• Only charged for source cluster
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
• Automatic SQL endpoint
switchover via DNS
• Decommission the source cluster
• Simple operation via Console or
API
Amazon Redshift parallelizes and distributes everything
• Query
• Load
• Backup/Restore
• Resize
Amazon Redshift is priced to let you analyze all your data
• Number of nodes x cost
per hour
• No charge for leader
node
• No upfront costs
• Pay as you go
DW1 (HDD)Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB
On-Demand $ 0.850 $ 3,723
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DW2 (SSD)Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB
On-Demand $ 0.250 $ 13,688
1 Year Reservation $ 0.161 $ 8,794
3 Year Reservation $ 0.100 $ 5,498
Amazon Redshift is easy to use
• Provision in minutes
• Monitor query performance
• Point and click resize
• Built in security
• Automatic backups
Amazon Redshift has security built-in
• SSL to secure data in transit
• Encryption to secure data at rest– AES-256; hardware accelerated
– All blocks on disks and in Amazon S3 encrypted
– HSM Support so you control keys
• Audit logging & AWS CloudTrailintegration
• Amazon VPC support
• SOC 1/2/3, PCI DSS Level 1, Fedrampand more
10 GigE
(HPC)
Ingestion
Backup
Restore
Customer VPC
Internal
VPC
JDBC/ODBC
Amazon Redshift continuously backs up your
data and recovers from failures
• Replication within the cluster and backup to Amazon S3 to maintain multiple copies of
data at all times
• Backups to Amazon S3 are continuous, automatic, and incremental
– Designed for eleven nines of durability
• Continuous monitoring and automated recovery from failures of drives and nodes
• Able to restore snapshots to any Availability Zone within a region
• Easily enable backups to a second region for disaster recovery
60+ new features since launch• Regions – N. Virginia, Oregon, Dublin, Tokyo, Singapore, Sydney
• Certifications – PCI, SOC 1/2/3, FedRAMP, PCI-DSS Level 1, others
• Security – Load/unload encrypted files, Resource-level IAM, Temporary credentials, HSM, ECDHE
for perfect forward security
• Manageability – Snapshot sharing, backup/restore/resize progress indicators, Cross-region backups
• Query – Regex, Cursors, MD5, SHA1, Time zone, workload queue timeout, HLL, Concurrency to 50
• Ingestion – S3 Manifest, LZOP/LZO, JSON built-ins, UTF-8 4byte, invalid character substitution,
CSV, auto datetime format detection, epoch, Ingest from SSH, JSON, EMR
Amazon Redshift Feature Delivery
Service Launch (2/14)
PDX (4/2)
Temp Credentials (4/11)
DUB (4/25)
SOC1/2/3 (5/8)
Unload Encrypted Files
NRT (6/5)
JDBC Fetch Size (6/27)
Unload logs (7/5)
SHA1 Builtin (7/15)
4 byte UTF-8 (7/18)
Sharing snapshots (7/18)
Statement Timeout (7/22)
Timezone, Epoch, Autoformat (7/25)
WLM Timeout/Wildcards (8/1)
CRC32 Builtin, CSV, Restore Progress (8/9)
Resource Level IAM (8/9)
PCI (8/22)
UTF-8 Substitution (8/29)
JSON, Regex, Cursors (9/10)
Split_part, Audit tables (10/3)
SIN/SYD (10/8)
HSM Support (11/11)
Kinesis EMR/HDFS/SSH copy, Distributed Tables, Audit
Logging/CloudTrail, Concurrency, Resize Perf., Approximate Count
Distinct, SNS Alerts, Cross Region Backup (11/13)
Distributed Tables, Single Node Cursor Support, Maximum Connections to 500
(12/13)
EIP Support for VPC Clusters (12/28)
New query monitoring system tables and diststyle all (1/13)
Redshift on DW2 (SSD) Nodes (1/23)
Compression for COPY from SSH, Fetch size support for single node
clusters, new system tables with commit stats, row_number(), strotol() and query
termination (2/13)
Resize progress indicator & Cluster Version (3/21)
Regex_Substr, COPY from JSON (3/25)
50 slots, COPY from EMR, ECDHE ciphers (4/22)
3 new regex features, Unload to single file, FedRAMP(5/6)
Rename Cluster (6/2)
Copy from multiple regions, percentile_cont, percentile_disc (6/30)
Free Trial (7/1)
New Features
• UNLOAD to single file
• COPY from multiple regions
• Percentile_cont & percentile_disc window
functions
Try Amazon Redshift with BI & ETL for Free!
• http://aws.amazon.com/redshift/free-trial
• 2 months, 750 hours/month of free DW2.Large usage
• Also try BI & ETL for free from nine partners
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
A Year With Amazon Redshift
Why Upworthy Chose Redshift And How It’s Going
July 10, 2014
What’s Upworthy
• We’ve been called:– “Social media with a mission” by our About Page
– “The fastest growing media site of all time” by Fast Company
– “The Fastest Rising Startup” by The Crunchies
– “That thing that’s all over my newsfeed” by my annoyed friends
– “The most data-driven media company in history” by me,
optimistically
What We Do
• We aim to drive massive amounts of attention to things that really matter.
• We do that by finding, packaging, and distributing great, meaningful content.
Our Use Case
When We Started
• Building a data warehouse from scratch
• One engineer on the project
• Object data in MongoDB
• Had discovered MoSQL
• Knew which two we’d choose:– Comprehensive
– Ad Hoc
– Real-Time
The Decision
• Downsides to self-hosting– Cost
– Maintenance
• Tried Redshift by the hour
Building it out initially• ~50 events/second
• Snowflake or denormalized? Both.
Raw
S3
Events
Redshift
MongoDB
Browser Web Service S3 Drain
MoSQL PostgreSQL Objects
Master
EMR
Processed
Objects
Our system now
• Stats:– ~5 TB of compressed data.
– Two main tables = 13 billion rows
– Average: ~1085 events/second
– Peak: ~2500 events/second
• 5-10 minute ETL cycle (Kinesis session later)
• Lots of rollup tables
• COPY happens quickly (5-10s) every 1-2 minutes
What We’ve Learned
Initial Lessons
• Columnar can be disorienting:
– What’s in the SELECT matters. A lot.
– SELECT COUNT(*) is really, really fast.
– SELECT * is really, really slow.
– Wide, tall tables with lots of NULLs = Fine.
• Sortkeys are hugely powerful.
• Bulk Operations (COPY, not INSERT) FTW
Staging to Master PatternINSERT INTO master
WITH to_insert as (
SELECT
col1,
col2,
col3
FROM
staging
)
SELECT
s.*
FROM
to_insert s
LEFT JOIN master m on m.col1 = s.col1
WHERE
m.col1 is NULL;
Hash multiple join keys into one...
FROM
table1 t1
LEFT JOIN table2 t2 on
t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3
md5(col1 || col2 || col3) as hashed_join_key
...
FROM
table1 t1
LEFT JOIN table2 t2 on t1.hashed_join_key = t2.hashed_join_key
Pain Points
• Data loading can be a bit tricky at first– Actually read the documentation. It’ll help.
• COUNT(DISTINCT col1) can be painful– Use APPROXIMATE COUNT(DISTINCT col1) instead
• No null-safe operator. NULL = NULL returns NULL– NVL(t1.col1, 0) = NVL(t2.col1, 0) is null-safe
• Error messages can be less than informative– AWS Redshift Forum is fantastic. Someone else has probably seen that error
before you.
Where We’re Going Next
In The Next Few Months
• More structure to our ETL (implementing luigi)
• Need to avoid Serializable Isolation Violations
• Dozens of people querying a cluster of 4 dw1.xlarge
nodes can get ugly
• “Write” node of dw1 nodes and “Read” cluster of dw2
“dense compute” nodes
Thank You!