SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Tom Johnston, S3 Product Management, AWS Tom Fuller, Senior Solutions Architect, AWS

John Elliott, Infrastructure Engineering, Pinterest

April 19, 2017

Deep Dive on Object Storage Amazon S3 and Amazon Glacier

Cloud Data Migration

Direct Connect

Snow* data transport

family

3rd Party Connectors

Transfer Acceleration

Storage Gateway

Amazon Kinesis Firehose

The AWS Storage Portfolio

Object

Amazon Glacier Amazon S3

Block

Amazon EBS (persistent)

Amazon EC2 Instance Store

(ephemeral) File

Amazon EFS

What to Expect from the Session • Pick the right storage class for your use cases • Automate management tasks • Best practices to optimize S3 performance • Tools to help you manage storage

AWS Direct Connect AWS Snowball ISV Connectors

Amazon Kinesis Firehose

S3 Transfer Acceleration

AWS Storage Gateway

Data transfer into Amazon S3

AWS Snowmobile

AWS Snowball Edge

Amazon Storage Partner Solutions

aws.amazon.com/backup-recovery/partner-solutions/ Note: Represents a sample of storage partners

Backup and Recovery Primary Storage Archive

Solutions that leverage file, block, object, and streamed data formats as an extension to on-premises storage

Solutions that leverage Amazon S3 for durable data backup

Solutions that leverage Amazon Glacier for durable and cost-effective

long-term data backup

Choice of storage classes on S3

Standard

Active data Archive data Infrequently accessed data

Standard - Infrequent Access Amazon Glacier

Storage classes designed for your use case

S3 Standard • Big data analysis • Content distribution • Static website

hosting

Standard - IA • Backup & archive • Disaster recovery • File sync & share • Long-retained data

Amazon Glacier • Long term archives • Digital preservation • Magnetic tape

replacement

When should you move to Standard-IA?

S3 Analytics - storage class analysis

• Visualize the access pattern on your data over time

• Measure the object age where data is infrequently accessed

• Dive deep by bucket, prefixes, or specific object tag

• Easily create a lifecycle policy based on the analysis

Visualize access pattern on your data

Export S3 Analytics to the tools of your choice

Pick the right storage class for your use cases Automate management tasks • Best practices to optimize S3 performance • Tools to help you manage storage

Automate data management Lifecycle policies

• Automatic tiering and cost controls • Includes two possible actions:

• Transition: archives to Standard - IA or Amazon Glacier based on object age you specified

• Expiration: deletes objects after specified time

• Actions can be combined • Set policies by bucket, prefix, or tags • Set policies for current version or non-

current versions Lifecycle policies

Set up a lifecycle policy on the AWS Management Console

Protect your data from accidental deletes

• Protects from unintended user deletes or application logic failures

• New version with every upload

• Easy retrieval of deleted objects and roll back to previous versions

Best Practice

Versioning

Easily recover from unintended delete Tip: Create a recycle bin for your storage

Best Practice

Automate with trigger-based workflow Amazon S3 event notifications

Events

SNS topic

SQS queue

Lambda function

• Notification when objects are created via PUT, POST, Copy, Multipart Upload, or DELETE

• Filter on prefixes and suffixes

• Trigger workflow with Amazon SNS, Amazon SQS, and AWS Lambda functions

Cross-region replication Automated, fast, and reliable asynchronous replication of data across AWS Regions

Use cases: • Compliance - store data hundreds of miles apart • Lower latency - distribute data to regional customers • Security - create remote replicas managed by separate AWS accounts

How it works: • Only replicates new PUTs. Once configured, all new uploads into source

bucket will be replicated • Entire bucket or prefix based • 1:1 replication between any 2 regions • Versioning required • Deletes and lifecycle actions are not replicated

Summary – automate management tasks

Cross-region replication

Automate transition and expiration with

lifecycle policies

Trigger-based workflow with

event notification

Easily recover from accidental delete with versioning

Topics Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance • Tools to help you manage storage

Faster upload over long distances S3 Transfer Acceleration

S3 Bucket AWS Edge Location

Uploader

Optimized Throughput!

Change your endpoint, not your code

No firewall changes or client software

Longer distance, larger files, more benefit

Faster or free

68 global edge locations

Try it at S3speedtest.com

Faster upload of large objects Parallelize PUTs with multipart uploads

• Increase aggregate throughput by parallelizing PUTs on high-bandwidth networks • Move the bottleneck to the network,

where it belongs

• Increase resiliency to network errors; fewer large restarts on error-prone networks

Best Practice

Faster download You can parallelize GETs as well as PUTs

GET /example-object HTTP/1.1 Host: example-bucket.s3.amazonaws.com x-amz-date: Fri, 28 Jan 2016 21:32:02 GMT Range: bytes=0-9 Authorization: AWS AKIAIOSFODNN7EXAMPLE:Yxg83MZaEgh3OZ3l0rLo5RTX11o=

For large objects, use range-based GETs align your get ranges with your parts

For content distribution, enable Amazon CloudFront • Caches objects at the edge • Low latency data transfer to end user

SQL Query on S3

Amazon Athena

• No loading of data

• Serverless

• Supports text, CSV, TSV, JSON, AVRO, and columnar formats such as Apache ORC and Apache Parquet

• Access via console or JDBC driver

• $5 per TB scanned from S3

Getting Started – Athena with console

Query your S3 data using SQL

Run time and data scanned

<my_bucket>/2013_11_13-164533125.jpg <my_bucket>/2013_11_13-164533126.jpg <my_bucket>/2013_11_13-164533127.jpg <my_bucket>/2013_11_13-164533128.jpg <my_bucket>/2013_11_12-164533129.jpg <my_bucket>/2013_11_12-164533130.jpg <my_bucket>/2013_11_12-164533131.jpg <my_bucket>/2013_11_12-164533132.jpg <my_bucket>/2013_11_11-164533133.jpg

Use a key-naming scheme with randomness at the beginning for high TPS

• Most important if you regularly exceed 100 TPS on a bucket • Avoid starting with a date or monotonically increasing numbers

Don’t do this…

Higher TPS by distributing key names

Distributing key names

Add randomness to the beginning of the key name with a hash or reversed timestamp (ssmmhhddmmyy)

<my_bucket>/521335461-2013_11_13.jpg <my_bucket>/465330151-2013_11_13.jpg <my_bucket>/987331160-2013_11_13.jpg <my_bucket>/465765461-2013_11_13.jpg <my_bucket>/125631151-2013_11_13.jpg <my_bucket>/934563160-2013_11_13.jpg <my_bucket>/532132341-2013_11_13.jpg <my_bucket>/565437681-2013_11_13.jpg <my_bucket>/234567460-2013_11_13.jpg <my_bucket>/456767561-2013_11_13.jpg <my bucket>/345565651 2013 11 13 jpg

Best Practices - performance

Faster upload over long distances with S3 Transfer Acceleration

Faster upload for large objects with S3 multipart upload

Optimize GET performance with Range GET and CloudFront

SQL Query on S3 with Athena

Distribute key name for high TPS workload

Topics Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance Tools to help you manage storage

Organize your data with object tags

Manage data based on what it is as opposed to where its located

• Classify your data, up to 10 tags per object

• Tag your objects with key-value pairs

• Write policies once based on the type of data

• Put object with tag or add tag to existing objects

Storage metrics

& analytics Lifecycle policy Access control

Manage access with object tags

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*" "Condition": {"StringEquals": {"s3:RequestObjectTag/Project": "X"}} } ] }

User permission by tags

Use cases: • Perform security analysis • Meet your IT auditing and compliance needs • Take immediate action on activity How it works: • Capture S3 object-level requests • Enable at the bucket level • Logs delivered to your S3 bucket • $0.10 per 100,000 data events

Audit and monitor access AWS CloudTrail data events

Monitor performance and operation Amazon CloudWatch metrics for S3

• Generate metrics for data of your choice • Entire bucket, prefixes, and tags • Up to 1,000 groups per bucket

• 1-minute CloudWatch metrics • Alert and alarm on metrics • $0.30 per metric per month

CloudWatch Metrics for S3

Metric Name value AllRequests Count PutRequests Count GetRequests Count ListRequests Count DeleteRequests Count HeadRequests Count PostRequests Count

Metric Name value BytesDownloaded MB BytesUploaded MB 4xxErrors Count 5xxErrors Count FirstByteLatency ms TotalRequestLatency ms

Example

S3 Inventory

Save time Daily or weekly delivery Delivery to S3 bucket CSV File Output

Use case: trigger business workflows and applications such as secondary index garbage collection, data auditing, and offline analytics

• More information about your objects than provided by LIST API, such as replication status, multipart

upload flag and delete marker

• Simple pricing: $0.0025 per million objects listed

S3 Inventory

Eventually consistent rolling snapshot • New objects may not be listed • Removed objects may still be included

Name Value Type Description

Bucket String Bucket name. UTF-8 encoded.

Key String Object key name. UTF-8 encoded.

Version Id String Version ID of the object

Is Latest boolean true if object is the latest version (current version) of a versioned object, otherwise false

Delete Marker boolean true if object is a delete marker of a versioned object, otherwise false

Size long Object size in bytes

Last Modified String Last modified timestamp. Format in ISO: YYYY-MM-DDTHH:mm:ss.SSSZ

ETag String eTag in HEX encoded format

StorageClass String Valid values: STANDARD, REDUCED_REDUNDANCY, GLACIER, STANDARD_IA. UTF-8 encoded.

Multipart Uploaded boolean true if object is uploaded by using multipart, otherwise false

Replication Status String Valid values: REPLICA, COMPLETED, PENDING, FAILED. UTF-8 encoded.

Validate before you act! • Use HEAD OBJECT

John Elliott Pinterest Infrastructure

45

100+ billion pins categorized by people into more than

2.6 billion boards

46

80+ terabytes of new data...every day

Almost entirely log data...

Over 150 petabytes of data

47

S3 Growth

49

Storage Growth

YTD 60%

12 Months 86%

Since Jan ‘14 1,467%

S3 Data Structure

50

Level 1 Level 2 Level 3 Level 4 Bucket/ Application/ Table Name/ dt=2017-04-13/

Inventory Job

Operations Job

Efficiency Job

● Count object sizes and read API log ● Join data sets to determine object access

activity in order to make tiering decisions

S3 API logs

Rollup Job

Efficiency Report

S3 bucket listing

Old Data Flow 6hr runtime

● S3 Inventory report allows full bucket inventory and operations data

● S3 Analytics provides much needed data on object age and access patterns

Rollup Job S3 Analytics

S3 Inventory

Report

New Data Flow 20 min runtime

Setting up Inventory Analysis for S3 DEMO

Enable Inventory

Process Daily Files

Discover Interesting Prefixes

Storage Analytics

Lifecycle Policy

Summary – manage your storage

Classify storage and manage access with S3 object tags

Audit and monitor access with CloudTrail

Monitor operational performance and set alarm with S3 CloudWatch metrics

Use Inventory and discover interesting prefixes to dive deeper on

Recap

Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance Tools to help you manage storage

Thank you!

Enable Inventory

Process Daily Files

Discover Interesting Prefixes

SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Technology

Transcript of SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier