Post on 13-Jul-2015
Scaling up (and easing) operations at 1 Million TPS @ <1 ms latency.
LSPE, Jun 14, 2014
Agenda of this talk
● Some types of B ig Data?● What are the problems that come with scale?● What is the solution? (Or how Aerospike tackles
these problem and how is Aerospike the solution for the above problems).
● Anshu Prateek● Aerospike Devops Lead● Ex - Yahoo! Search Operations● http://about.me/anshuprateek● anshu@aerospike.com
Big Data Type
● Volume – Hadoop – PB / Hrs of jobs● Variety – ETL – Many data sources, mashup,
analyze● Velocity – Do it fast, do it now!
→ Volume and Variety need Velocity to be useful.
What starts failing at scale?
● Machines / hardware ● Network● Unplanned load● Operator error
Big Data..
● Volume – Hadoop – PB / Hrs of jobs● Variety – ETL – Many data sources, mashup,
analyze● Velocity – Do it fast, do it now!
→ Volume and Variety need Velocity to be useful.
Velocity in Aerospike
● Latency
Page SLA 700ms , Ads SLA 50 ms
→Data store <5ms– Hybrid DRAM + SSD optimized storage
● Throughput– Horizontal scalability (Linear is desirable)
Prod example:
● 20 Nodes● 1.6TB per node● 50GB DRAM usage● 14 Billion objects● 70k TPS (r+w) per node peak
● 98% of queries < 1ms●
Yet another prod graph...
What starts failing at scale?
● Machines / hardware ● Network● Unplanned load● Operator error
Start scaling with Aerospike..
● Machines / hardware – Replication / auto-balancing
● Network– Availability of islands– Auto balancing with eventual consistency
● Unplanned load– Have lot of headroom
● Operator error– What if the system reduces operational needs– Tools
Operational Ease
● Reducing initial setup time– Auto sharding– Auto cluster discovery
● Configuration– People don't read documents
● RTFM!
– Good default value– retain the power to control when needed
● Static configs● Dynamic configs
Tools
● Do all nodes have same config?– asmonitor -e 'compareconfig'
● Whats the cluster status?– asmonitor -e 'info'
● Oops, this needs to be changed!– asinfo -v 'set-
config:context=service;letschangethis=value'
Tools
● Nagios● Graphite● AMC
Capacity Planning
Managing with AMC
Managing with AMC
Managing with AMC
Headroom!
● How many TPS can we do ?
● 330 GCE● 300 x 1TB● Debian, Cassandra 2.2● Median Latency – 10.3 ms● 95% < 23 ms
Aerospike