S3 & Glacier - The only backup solution you'll ever need

S3 & Glacier

Amazon S3 & Glacier - the only backup solution you’ll ever need

Matthew Boeckman, VP - DevOps, Craftsy@matthewboeckman

http://enginerds.craftsy.com http://devops.com/author/matthewboeckman

http://enginerds.craftsy.com/

http://devops.com/author/matthewboeckman

Who is Craftsy

● Instructor led training videos for passionate hobbyists● #19 on Forbes’ Most Promising Companies 2014

Why do backups matter?

… they don’t, until they do.

At small scale, backup can and is as easy as icloud, or google drive, or github.

As you grow, the situation gets more complex.

In a content creation business, backups are vital.

RESTORE

Define your exposure

daily backups - protect against accidental deletion (users) and system corruption/crash

DR - your building catches on fire. an employee goes rogue. the SAN suffers a total existence failure.

permanent archive - data at rest, legal/business permanence

what are my options

LTO - hope you like lifecycle hardware refreshes! hope your data rests on a 1.5 or 3tb boundry. hope you enjoy swapping tapes… 4-17year durability

~$.008/GB** does not include operational costs **

DISK - good luck offsiting, hope you enjoy hardware maintenance, hope you’ve got a big checkbook

~$5/GB** does not include operational costs **

hobo as a service

● not scalable● expensive● not addressable in code● requires software

systems

welcome to system adminsitrationville, population: you

why did we look for alternatives?

What are the alternatives?

s3

why s3?

● $0.03/GB● 99.999999999% durable● easy interface with glacier● addressable with code

“If you store 10,000 objects with us, on average we may lose one of them every 10 million years or so. This storage is designed in such a way that we can sustain the concurrent loss of data in two separate storage facilities.”

except...

this is a busy space

free tools, one thread, small files

http://s3tools.org/s3cmd ● great sync support● supports multipart● Does not support multithread● tons of other features (du, cp, rm, setpolicy, cloudfront…)

https://github.com/boto ● the aws python library● includes basic cli s3put, great reference point

great, but not fast

http://s3tools.org/s3cmd

https://github.com/boto

free tools, multipass, big files

https://github.com/mumrah/s3-multipart ● zomg multithreaded, multipart up and down● saturated 100M pipe for a single file transfer● does not handle small files

https://github.com/aws/aws-cli ● zomg multithreaded, multipart up and down● saturated 500M pipe for a single file transfer● perfect in every imaginable way

https://github.com/mumrah/s3-multipart

https://github.com/aws/aws-cli

problems

approach single archive (tar)

lots o files, some big some small

le good ● fast transfer● simple● no filename

pain

● restore & navigation granularity

le bad ● restore all● can’t nav your

backup

● processing overhead

● users name things...

our first backup script ... testy.sh

*shitty speed, restore costs, let’s just get to the next slide

to mp or not to mp...

size=$(stat -f %z "$x")

if [ "$size" -gt "100000000" ]; then

s3-mp-upload.py ${S3MPT_OW} -np 10 -s 100 "$x" "s3://$

{BUCKET}/$CID/${x}" >> $LOG 2>&1

else

s3put ${S3PUT_OW} -s $AWS_ACCESS_SECRET -a

$AWS_ACCESS_KEY -b $BUCKET -p "$CPATH" "$x" >> $LOG

2>&1

fi

aws s3 cli

aws s3 sync "$CID" "s3://${BUCKET}/${CID}" --delete

aws s3 cp "$SRC" "s3://${BUCKET}/${CID}" --recursive

simple, multipart, multithreaded and saturates 500mbps both ways

hooray! now what

why glacier?

● $0.01/GB● durable like s3● data sent to glacier via s3 policies cannot be

accidentally deleted

s3->glacier

prefix, delete

restore from glacier

restore, our UI

restore from glacier as code

foreach ($objects as $flerbs) { $key = $flerbs->Key; $restore = $s3->restore_archived_object ($bucket, $key, $days ); //var_dump($restore); $http_code = $restore->status; $xml_resp = $restore->body; $message = $xml_resp->Message; $header = $restore->header; $req_id = $header['x-amz-request-id']; $amz_id2 = $header['x-amz-id-2']; $req_date = $header['date'];}

oooh, a progress gui

restore status as code

$header = $s3->get_object_headers($bucket, $key)->header; $restore = explode(',', $header['x-amz-restore']);$progress = strpos($restore[0], 'true');…if ($sclass == 'GLACIER') { if ($header['x-amz-restore'] == '') { $style = '"background-color: #0ce3ff;"'; // not restored! $note = ''; } elseif ($progress != false) { $style = '"background-color: #ffa391;"'; // restore in progress! $note = 'Restoration in progress'; } else { $style = '"background-color: #95fed0;"'; // restored! $note = "Restored until $expirydate"; }

versioning, another approach

● versioning operates on the bucket level● each version of an object counts as storage ● versioning happens automatically with any PUT● deletion of an object does not explicitly delete versions (!!)

GET /my-image.jpg?versionId=L4kqtJlcpXroDTDmpUMLUo

GET /my-image.jpg?versions HTTP/1.1

DELETE /my-image.jpg?versionId=UIORUnfnd89493jJFJ

restoring a version

Restore to the most recent version?

1. delete the current version!

Restore to an older version?

2. GET the desired version

3. PUT it back on top of the object you’re restoring

where does Craftsy live?

1. Bi-weeklies of ACTIVE content to a non-versioned, non-glaciered bucket (~200TB total, +2TB/wk)

2. Weekly new finished content to glacier-backed s3 bucket (~500TB, +45TB/mo)

3. All code artifacts (dev, staging, prod) to a versioned s3 bucket

4. Postgres backup segments to an non-glaciered bucket

most important, we do all of it in code. we don’t shuffle tapes, upgrade hardware, or system administrate!

Thank you

QUESTIONS!

Matthew Boeckman

@matthewboeckman

http://enginerds.craftsy.com

(deck will be there & slideshare)http://devops.com/author/matthewboeckman

http://enginerds.craftsy.com/

http://devops.com/author/matthewboeckman

S3 & Glacier - The only backup solution you'll ever need

Technology

Transcript of S3 & Glacier - The only backup solution you'll ever need