To Cloud or Not To Cloud?
-
Upload
greg-lindahl -
Category
Technology
-
view
579 -
download
0
description
Transcript of To Cloud or Not To Cloud?
About Us
• Web-‐scale search engine with our own crawl & index
• Public launch, November 2010
• $60 M raised • 800 servers, 16 PB spinning rust, ½ PB flash disk
blekko.com
izik – tablet search
The wiring diagram
Web Crawler Extractor Ranker Indexer
Lookup Query
Analyzer Front End Query SERP
DIG KB
Hijacking a meetup topic
• Original topic was “virtualizaUon or not” • But really, virtualizaUon is an implementaUon detail these days – cloud => virtual – virtual => public or private cloud (probably)
• This talk: Public cloud vs. not • I’m trying to list a bunch of things that you should think about … your situaUon probably differs from mine
The quesUon
• It’s 2007, and your CEO asks you:
Should our new startup use this newfangled cloud compuUng stuff or not?
Why cloud at all?
• Flexible – prototyping & development – tesUng at scale – scale up for high usage and back down later
• Turns CapEx into OpEx – startups prefer paying over Ume – “money tomorrow is cheaper than money today”, if you’re successful
{btw, plenty of banks will loan against equipment.}
Cloud win examples
• CommonCrawl.org has a web crawl dataset on EC2 – Map/Reduce job to read the whole thing is ~ $50
• Fewer ops people is actually true
• Your company can change direcUon
OK, so what’s bad?
• Examine the curve of Amazon’s pricing over Ume and per volume
• People think it’s a low-‐priced product, but it’s not.
• It’s value priced. • Not enough compeUUon, yet, to really drive Amazon’s margins down
• This is good for Amazon, maybe not for you.
6 Reasons to not use Amazon
• Economy of scale in your favor? • Your max::min raUo is not large enough • Cloud IOPs are expensive • Data is heavy if you use a lot of local disk • SSDs are overpriced • RaUo of disk capacity or bandwidth :: ssd :: memory :: compute may not be ideal for you
Economy of scale
• “Amazon has 100s of thousands of servers, so they can run them cheaper than I can.”
• But: – you pay retail, not wholesale price – there are diminishing returns with size
• At some point, it’s cheaper to do it yourself • 100 servers? 50 servers?
{ blekko had 700 at launch… }
Your max::min raUo is not big enough
• Maybe you use 100x as many servers some days? – Cloud is for you!
• How long do your usage spikes last? • Can you predict them far enough in advance? • How long does it take you to spin up a new node?
{blekko’s day::night is only 2x}
Cloud IOPs are expensive
• I/O OperaUons are expensive to start with – “spinning rust” disks only seek so much
• Networked storage has low bandwidth compared to 10 apached disks – 1 Gbyte/sec sustained – woah!
• Networked disks are more expensive than local – beper failure behavior, whether I want it or not
Data is heavy if you use a lot of local disk
• I mean: it takes a loooooong Ume to copy a few tbytes of data onto your local disk over the network – 1 gigabit: ½ tbyte/hour – 10 gigabit: 5 tbytes/hour – even filling your ½ tbyte SSD is kinda slow
• Slow spin-‐up/down of nodes hurts your ability to flex up and down
SSDs are overpriced (by cloud providers)
• SSDs are completely awesome for read-‐heavy analyUcs queries
• SSDs wear out with writes • No cloud provider charges a fee for writes? • Instead, they assume all their customers are average
• … and so they charge way too much to customers who are smart about not wriUng too much
{ blekko is great at not wriUng to our SSDs }
RaUos available might not fit your usage
• Amazon tries prepy hard: – high memory, high-‐CPU, GPU, high I/O, high-‐storage – weirder ones are less flexible
• It’s sUll easy to not fit into that set of cookie cupers
• Not firng == wasted money – idle resources that you’ve paid for – moves the break-‐even point to smaller node count
{ blekko crawler nodes: 10 local disks (capacity,
bandwidth, seeks), 2 ssds, 96 gigs ram}
So…
• For us, it was easy to predict the right answer • Our SWAG for launch day was 600 servers – and our enUre index in SSD – and we can’t scale down from that
• Amazon wasn’t renUng SSDs yet • If you’re going to run your own servers, you need to start early
How about you?
• RT analyUcs is a complicated subject
• Two main thrusts – Pre: pre-‐compute aggregate numbers, query those
– Mem: sUck a subset of your big data that fits into ram or ssd, do complicated queries against those
{ blekko only does Pre }
Pre
• Needs to be wired into your stream of data generaUon, e.g. your webserver
• Summary data can be prepy small • Doesn’t really maper where you put it • Not much impact on the cloud/no-‐cloud decision
{ blekko pre-‐computes a lot of things using “combinators” in our home-‐grown NoSQL,
opUonally stuffing them into our SSD caching system }
SERVER 1
PROCESS 1 PROCESS 2
SERVER 2
PROCESS 1 PROCESS 2
DISK 1 DISK 2 DISK 3
+4 +3 +4 +7
+11+11+11
+7
+7+7
+18 +18 +18
Combinators reduce the total work
Mem
• Even a decimated subset of your fresh data can involve a lot of write bandwidth – SomeUmes referred to as “high velocity”
• High BW probably needs to go nearby your big data store
• AnalyUcs probably isn’t going to influence the cloud/not-‐cloud decision
Discuss!
• Discuss
• For more about blekko’s setup: – 3 part blog series at highscalability.com – Please search [high scalability blekko] in your search engine of choice
– [email protected] -‐-‐-‐ @glindahl