To Cloud or Not To Cloud?

To Cloudor Not to Cloud?

Greg Lindahl, CTO

@glindahl – [email protected]"

About Us

•  Web-‐scale search engine with our own crawl & index

•  Public launch, November 2010

•  $60 M raised •  800 servers, 16 PB spinning rust, ½ PB flash disk

blekko.com

izik – tablet search

The wiring diagram

Web Crawler Extractor Ranker Indexer

Lookup Query

Analyzer Front End Query SERP

DIG KB

Hijacking a meetup topic

•  Original topic was “virtualizaUon or not” •  But really, virtualizaUon is an implementaUon detail these days – cloud => virtual – virtual => public or private cloud (probably)

•  This talk: Public cloud vs. not •  I’m trying to list a bunch of things that you should think about … your situaUon probably differs from mine

The quesUon

•  It’s 2007, and your CEO asks you:

Should our new startup use this newfangled cloud compuUng stuff or not?

Why cloud at all?

•  Flexible – prototyping & development –  tesUng at scale – scale up for high usage and back down later

•  Turns CapEx into OpEx – startups prefer paying over Ume – “money tomorrow is cheaper than money today”, if you’re successful

{btw, plenty of banks will loan against equipment.}

Cloud win examples

•  CommonCrawl.org has a web crawl dataset on EC2 – Map/Reduce job to read the whole thing is ~ $50

•  Fewer ops people is actually true

•  Your company can change direcUon

OK, so what’s bad?

•  Examine the curve of Amazon’s pricing over Ume and per volume

•  People think it’s a low-‐priced product, but it’s not.

•  It’s value priced. •  Not enough compeUUon, yet, to really drive Amazon’s margins down

•  This is good for Amazon, maybe not for you.

6 Reasons to not use Amazon

•  Economy of scale in your favor? •  Your max::min raUo is not large enough •  Cloud IOPs are expensive •  Data is heavy if you use a lot of local disk •  SSDs are overpriced •  RaUo of disk capacity or bandwidth :: ssd :: memory :: compute may not be ideal for you

Economy of scale

•  “Amazon has 100s of thousands of servers, so they can run them cheaper than I can.”

•  But: – you pay retail, not wholesale price –  there are diminishing returns with size

•  At some point, it’s cheaper to do it yourself •  100 servers? 50 servers?

{ blekko had 700 at launch… }

Your max::min raUo is not big enough

•  Maybe you use 100x as many servers some days? – Cloud is for you!

•  How long do your usage spikes last? •  Can you predict them far enough in advance? •  How long does it take you to spin up a new node?

{blekko’s day::night is only 2x}

Cloud IOPs are expensive

•  I/O OperaUons are expensive to start with – “spinning rust” disks only seek so much

•  Networked storage has low bandwidth compared to 10 apached disks – 1 Gbyte/sec sustained – woah!

•  Networked disks are more expensive than local – beper failure behavior, whether I want it or not

Data is heavy if you use a lot of local disk

•  I mean: it takes a loooooong Ume to copy a few tbytes of data onto your local disk over the network – 1 gigabit: ½ tbyte/hour – 10 gigabit: 5 tbytes/hour – even filling your ½ tbyte SSD is kinda slow

•  Slow spin-‐up/down of nodes hurts your ability to flex up and down

SSDs are overpriced (by cloud providers)

•  SSDs are completely awesome for read-‐heavy analyUcs queries

•  SSDs wear out with writes •  No cloud provider charges a fee for writes? •  Instead, they assume all their customers are average

•  … and so they charge way too much to customers who are smart about not wriUng too much

{ blekko is great at not wriUng to our SSDs }

RaUos available might not fit your usage

•  Amazon tries prepy hard: –  high memory, high-‐CPU, GPU, high I/O, high-‐storage – weirder ones are less flexible

•  It’s sUll easy to not fit into that set of cookie cupers

•  Not firng == wasted money –  idle resources that you’ve paid for – moves the break-‐even point to smaller node count

{ blekko crawler nodes: 10 local disks (capacity,

bandwidth, seeks), 2 ssds, 96 gigs ram}

So…

•  For us, it was easy to predict the right answer •  Our SWAG for launch day was 600 servers – and our enUre index in SSD – and we can’t scale down from that

•  Amazon wasn’t renUng SSDs yet •  If you’re going to run your own servers, you need to start early

How about you?

•  RT analyUcs is a complicated subject

•  Two main thrusts – Pre: pre-‐compute aggregate numbers, query those

– Mem: sUck a subset of your big data that fits into ram or ssd, do complicated queries against those

{ blekko only does Pre }

Pre

•  Needs to be wired into your stream of data generaUon, e.g. your webserver

•  Summary data can be prepy small •  Doesn’t really maper where you put it •  Not much impact on the cloud/no-‐cloud decision

{ blekko pre-‐computes a lot of things using “combinators” in our home-‐grown NoSQL,

opUonally stuffing them into our SSD caching system }

SERVER 1

PROCESS 1 PROCESS 2

SERVER 2

PROCESS 1 PROCESS 2

DISK 1 DISK 2 DISK 3

+4 +3 +4 +7

+11+11+11

+7

+7+7

+18 +18 +18

Combinators reduce the total work

Mem

•  Even a decimated subset of your fresh data can involve a lot of write bandwidth – SomeUmes referred to as “high velocity”

•  High BW probably needs to go nearby your big data store

•  AnalyUcs probably isn’t going to influence the cloud/not-‐cloud decision

Discuss!

•  Discuss

•  For more about blekko’s setup: – 3 part blog series at highscalability.com – Please search [high scalability blekko] in your search engine of choice

– [email protected] -‐-‐-‐ @glindahl

To Cloud or Not To Cloud?

Technology

Transcript of To Cloud or Not To Cloud?