SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

70
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tom Johnston, S3 Product Management, AWS Tom Fuller, Senior Solutions Architect, AWS John Elliott, Infrastructure Engineering, Pinterest April 19, 2017 Deep Dive on Object Storage Amazon S3 and Amazon Glacier

Transcript of SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Page 1: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Tom Johnston, S3 Product Management, AWS Tom Fuller, Senior Solutions Architect, AWS

John Elliott, Infrastructure Engineering, Pinterest

April 19, 2017

Deep Dive on Object Storage Amazon S3 and Amazon Glacier

Page 2: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Cloud Data Migration

Direct Connect

Snow* data transport

family

3rd Party Connectors

Transfer Acceleration

Storage Gateway

Amazon Kinesis Firehose

The AWS Storage Portfolio

Object

Amazon Glacier Amazon S3

Block

Amazon EBS (persistent)

Amazon EC2 Instance Store

(ephemeral) File

Amazon EFS

Page 3: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

What to Expect from the Session • Pick the right storage class for your use cases • Automate management tasks • Best practices to optimize S3 performance • Tools to help you manage storage

Page 4: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

AWS Direct Connect AWS Snowball ISV Connectors

Amazon Kinesis Firehose

S3 Transfer Acceleration

AWS Storage Gateway

Data transfer into Amazon S3

AWS Snowmobile

AWS Snowball Edge

Page 5: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Amazon Storage Partner Solutions

aws.amazon.com/backup-recovery/partner-solutions/ Note: Represents a sample of storage partners

Backup and Recovery Primary Storage Archive

Solutions that leverage file, block, object, and streamed data formats as an extension to on-premises storage

Solutions that leverage Amazon S3 for durable data backup

Solutions that leverage Amazon Glacier for durable and cost-effective

long-term data backup

Page 6: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Choice of storage classes on S3

Standard

Active data Archive data Infrequently accessed data

Standard - Infrequent Access Amazon Glacier

Page 7: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Storage classes designed for your use case

S3 Standard • Big data analysis • Content distribution • Static website

hosting

Standard - IA • Backup & archive • Disaster recovery • File sync & share • Long-retained data

Amazon Glacier • Long term archives • Digital preservation • Magnetic tape

replacement

Page 8: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

When should you move to Standard-IA?

S3 Analytics - storage class analysis

• Visualize the access pattern on your data over time

• Measure the object age where data is infrequently accessed

• Dive deep by bucket, prefixes, or specific object tag

• Easily create a lifecycle policy based on the analysis

Page 9: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Visualize access pattern on your data

Page 10: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Page 11: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Export S3 Analytics to the tools of your choice

Page 12: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Page 13: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Pick the right storage class for your use cases Automate management tasks • Best practices to optimize S3 performance • Tools to help you manage storage

Page 14: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Automate data management Lifecycle policies

• Automatic tiering and cost controls • Includes two possible actions:

• Transition: archives to Standard - IA or Amazon Glacier based on object age you specified

• Expiration: deletes objects after specified time

• Actions can be combined • Set policies by bucket, prefix, or tags • Set policies for current version or non-

current versions Lifecycle policies

Page 15: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Set up a lifecycle policy on the AWS Management Console

Page 16: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Page 17: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Page 18: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Protect your data from accidental deletes

• Protects from unintended user deletes or application logic failures

• New version with every upload

• Easy retrieval of deleted objects and roll back to previous versions

Best Practice

Versioning

Page 19: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Easily recover from unintended delete Tip: Create a recycle bin for your storage

Best Practice

Page 20: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Automate with trigger-based workflow Amazon S3 event notifications

Events

SNS topic

SQS queue

Lambda function

• Notification when objects are created via PUT, POST, Copy, Multipart Upload, or DELETE

• Filter on prefixes and suffixes

• Trigger workflow with Amazon SNS, Amazon SQS, and AWS Lambda functions

Page 21: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Cross-region replication Automated, fast, and reliable asynchronous replication of data across AWS Regions

Use cases: • Compliance - store data hundreds of miles apart • Lower latency - distribute data to regional customers • Security - create remote replicas managed by separate AWS accounts

How it works: • Only replicates new PUTs. Once configured, all new uploads into source

bucket will be replicated • Entire bucket or prefix based • 1:1 replication between any 2 regions • Versioning required • Deletes and lifecycle actions are not replicated

Page 22: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Summary – automate management tasks

Cross-region replication

Automate transition and expiration with

lifecycle policies

Trigger-based workflow with

event notification

Easily recover from accidental delete with versioning

Page 23: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Topics Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance • Tools to help you manage storage

Page 24: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Faster upload over long distances S3 Transfer Acceleration

S3 Bucket AWS Edge Location

Uploader

Optimized Throughput!

Change your endpoint, not your code

No firewall changes or client software

Longer distance, larger files, more benefit

Faster or free

68 global edge locations

Try it at S3speedtest.com

Page 25: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Faster upload of large objects Parallelize PUTs with multipart uploads

• Increase aggregate throughput by parallelizing PUTs on high-bandwidth networks • Move the bottleneck to the network,

where it belongs

• Increase resiliency to network errors; fewer large restarts on error-prone networks

Best Practice

Page 26: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Faster download You can parallelize GETs as well as PUTs

GET /example-object HTTP/1.1 Host: example-bucket.s3.amazonaws.com x-amz-date: Fri, 28 Jan 2016 21:32:02 GMT Range: bytes=0-9 Authorization: AWS AKIAIOSFODNN7EXAMPLE:Yxg83MZaEgh3OZ3l0rLo5RTX11o=

For large objects, use range-based GETs align your get ranges with your parts

For content distribution, enable Amazon CloudFront • Caches objects at the edge • Low latency data transfer to end user

Page 27: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

SQL Query on S3

Amazon Athena

• No loading of data

• Serverless

• Supports text, CSV, TSV, JSON, AVRO, and columnar formats such as Apache ORC and Apache Parquet

• Access via console or JDBC driver

• $5 per TB scanned from S3

Page 28: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Getting Started – Athena with console

Page 29: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Query your S3 data using SQL

Run time and data scanned

Page 30: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

<my_bucket>/2013_11_13-164533125.jpg <my_bucket>/2013_11_13-164533126.jpg <my_bucket>/2013_11_13-164533127.jpg <my_bucket>/2013_11_13-164533128.jpg <my_bucket>/2013_11_12-164533129.jpg <my_bucket>/2013_11_12-164533130.jpg <my_bucket>/2013_11_12-164533131.jpg <my_bucket>/2013_11_12-164533132.jpg <my_bucket>/2013_11_11-164533133.jpg

Use a key-naming scheme with randomness at the beginning for high TPS

• Most important if you regularly exceed 100 TPS on a bucket • Avoid starting with a date or monotonically increasing numbers

Don’t do this…

Higher TPS by distributing key names

Page 31: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Distributing key names

Add randomness to the beginning of the key name with a hash or reversed timestamp (ssmmhhddmmyy)

<my_bucket>/521335461-2013_11_13.jpg <my_bucket>/465330151-2013_11_13.jpg <my_bucket>/987331160-2013_11_13.jpg <my_bucket>/465765461-2013_11_13.jpg <my_bucket>/125631151-2013_11_13.jpg <my_bucket>/934563160-2013_11_13.jpg <my_bucket>/532132341-2013_11_13.jpg <my_bucket>/565437681-2013_11_13.jpg <my_bucket>/234567460-2013_11_13.jpg <my_bucket>/456767561-2013_11_13.jpg <my bucket>/345565651 2013 11 13 jpg

Page 32: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Best Practices - performance

Faster upload over long distances with S3 Transfer Acceleration

Faster upload for large objects with S3 multipart upload

Optimize GET performance with Range GET and CloudFront

SQL Query on S3 with Athena

Distribute key name for high TPS workload

Page 33: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Topics Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance Tools to help you manage storage

Page 34: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Organize your data with object tags

Manage data based on what it is as opposed to where its located

• Classify your data, up to 10 tags per object

• Tag your objects with key-value pairs

• Write policies once based on the type of data

• Put object with tag or add tag to existing objects

Storage metrics

& analytics Lifecycle policy Access control

Page 35: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Manage access with object tags

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*" "Condition": {"StringEquals": {"s3:RequestObjectTag/Project": "X"}} } ] }

User permission by tags

Page 36: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Use cases: • Perform security analysis • Meet your IT auditing and compliance needs • Take immediate action on activity How it works: • Capture S3 object-level requests • Enable at the bucket level • Logs delivered to your S3 bucket • $0.10 per 100,000 data events

Audit and monitor access AWS CloudTrail data events

Page 37: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Monitor performance and operation Amazon CloudWatch metrics for S3

• Generate metrics for data of your choice • Entire bucket, prefixes, and tags • Up to 1,000 groups per bucket

• 1-minute CloudWatch metrics • Alert and alarm on metrics • $0.30 per metric per month

Page 38: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Page 39: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

CloudWatch Metrics for S3

Metric Name value AllRequests Count PutRequests Count GetRequests Count ListRequests Count DeleteRequests Count HeadRequests Count PostRequests Count

Metric Name value BytesDownloaded MB BytesUploaded MB 4xxErrors Count 5xxErrors Count FirstByteLatency ms TotalRequestLatency ms

Page 40: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Example

Page 41: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

S3 Inventory

Save time Daily or weekly delivery Delivery to S3 bucket CSV File Output

Use case: trigger business workflows and applications such as secondary index garbage collection, data auditing, and offline analytics

• More information about your objects than provided by LIST API, such as replication status, multipart

upload flag and delete marker

• Simple pricing: $0.0025 per million objects listed

Page 42: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

S3 Inventory

Eventually consistent rolling snapshot • New objects may not be listed • Removed objects may still be included

Name Value Type Description

Bucket String Bucket name. UTF-8 encoded.

Key String Object key name. UTF-8 encoded.

Version Id String Version ID of the object

Is Latest boolean true if object is the latest version (current version) of a versioned object, otherwise false

Delete Marker boolean true if object is a delete marker of a versioned object, otherwise false

Size long Object size in bytes

Last Modified String Last modified timestamp. Format in ISO: YYYY-MM-DDTHH:mm:ss.SSSZ

ETag String eTag in HEX encoded format

StorageClass String Valid values: STANDARD, REDUCED_REDUNDANCY, GLACIER, STANDARD_IA. UTF-8 encoded.

Multipart Uploaded boolean true if object is uploaded by using multipart, otherwise false

Replication Status String Valid values: REPLICA, COMPLETED, PENDING, FAILED. UTF-8 encoded.

Validate before you act! • Use HEAD OBJECT

Page 43: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

John Elliott Pinterest Infrastructure

45

Page 44: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

100+ billion pins categorized by people into more than

2.6 billion boards

46

Page 45: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

80+ terabytes of new data...every day

Almost entirely log data...

Over 150 petabytes of data

47

Page 46: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier
Page 47: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

S3 Growth

49

Storage Growth

YTD 60%

12 Months 86%

Since Jan ‘14 1,467%

Page 48: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

S3 Data Structure

50

Level 1 Level 2 Level 3 Level 4 Bucket/ Application/ Table Name/ dt=2017-04-13/

Page 49: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Inventory Job

Operations Job

Efficiency Job

● Count object sizes and read API log ● Join data sets to determine object access

activity in order to make tiering decisions

S3 API logs

Rollup Job

Efficiency Report

S3 bucket listing

Old Data Flow 6hr runtime

Page 50: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

● S3 Inventory report allows full bucket inventory and operations data

● S3 Analytics provides much needed data on object age and access patterns

Rollup Job S3 Analytics

S3 Inventory

Report

New Data Flow 20 min runtime

Page 51: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Setting up Inventory Analysis for S3 DEMO

Enable Inventory

Process Daily Files

Discover Interesting Prefixes

Storage Analytics

Lifecycle Policy

Page 52: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Summary – manage your storage

Classify storage and manage access with S3 object tags

Audit and monitor access with CloudTrail

Monitor operational performance and set alarm with S3 CloudWatch metrics

Use Inventory and discover interesting prefixes to dive deeper on

Page 53: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Recap

Pick the right storage class for your use cases Automate management tasks Best practices to optimize S3 performance Tools to help you manage storage

Page 54: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Thank you!

Page 55: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Enable Inventory

Page 56: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Enable Inventory

Page 57: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Enable Inventory

Page 58: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 59: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 60: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 61: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 62: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 63: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 64: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 65: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 66: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Process Daily Files

Page 67: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Discover Interesting Prefixes

Page 68: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Discover Interesting Prefixes

Page 69: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Discover Interesting Prefixes

Page 70: SRV403 Deep Dive on Object Storage: Amazon S3 and Amazon Glacier

Discover Interesting Prefixes