eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

21
Feng Qu Sr MTS eBay Database Infrastructure From here to there: Our journey to 1000s of nodes Couchbase Connect 2016

Transcript of eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Page 1: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Feng QuSr MTS eBay Database Infrastructure

From here to there:Our journey to 1000s of nodes

Couchbase Connect 2016

Page 2: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Feng Qu - Sr MTS in eBay DBA Team

• Have worked on Oracle since early 1990s• Have worked on Cassandra, MongoDB and

Couchbase since 2011• Led company wide NoSQL projects• 2014 and 2015 DataStax Cassandra MVP• Speaker at 2013, 2014 and 2015 Cassandra Sumit• Speaker at EDW 2016• Speaker at NoCoug 2016

Page 3: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

eBay At A Glance

Active Listings

1.1B

Active Users

164M

Total DB Calls

610B/day

Y-o-Y Growth

30%-35%

Total DB Servers

4000+

Peak DB Calls

15M/sec

RDBMS Calls

500B/day

NoSQL Calls

110B/day

Page 4: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Challenges of Traditional RDBMS

• Challenges• Performance penalty to maintain ACID features• Lack of native sharding and replication features• Cost of software/hardware• Higher cost of commit

Page 5: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Different Databases Serve Different Purposes

Page 6: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

NoSQL Databases Pros and Cons•Geo distributed replication & sharding•Location aware low latency query performance•Workload & access pattern optimized•Linear scalability with reduced disruption to business•Supports semi-structured or un-structured data•Flexible schema provides significant increase in Dev agility

•Lack strict ACID compliant transaction•Lack strong data model control & governance•Not suitable for ad-hoc workload & random access pattern•Requires change of mindset, ecosystem and infrastructure•Rapidly changing technology & competitive landscape•Requires Dev expertise in nuances of distributed system

Page 7: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

MongoDB Pros and Cons

• Dev friendly rich JSON document model• Secondary index enables mixed access patterns• High business value (semi-) structure data• Balanced scale-out reads & writes (with optional sharding)• Straightforward admin effort

• Short write interruption during primary re-election• Not suitable for nanosecond latency writing• Potentially high TCO for large scale sharded cluster• Lack resource isolation

Page 8: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Cassandra Pros and Cons

• Peer-to-peer without SPOF (Single Point of Failure)• Active-active cross Datacenter• High read & very high write performance• Absolute linear scalability

• Inefficient secondary index (pre-V3)• Not suitable for mixed user query & access patterns• High compaction overhead for frequent random deletes• Require JVM tuning to mitigate GC pauses• Lack resource isolation• Slow cluster rebalancing

Page 9: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Couchbase Pros and Cons• Memcached compatible persistent document store• Peer-to-peer architecture• High read & write performance• Active-active cluster replication• Strong local cluster RW consistency• Resource isolation

• Short write interruption during node failover• Counter intuitive cross DC write conflict resolution (pre V4.6)• Slow cluster rebalancing• Slow warm-up

Page 10: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

NoSQL Footprints at eBay

Besides Oracle/MySQL, we also have• Cassandra• Couchbase• MongoDB• HBase• Memcached• Neo4j• OpenTSDB• Redis• …

Page 11: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Why Couchbase?

• Memcached compatible persistent caching• Elastic scalability• High RW performance & throughput• Active-active bi-directional XDCR• High local cluster RW consistency• Flexible document model • Development agility• SQL integration• And more…

Page 12: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Environment• Support both dedicated & multi-tenant clusters• Couchbase Enterprise 3.1/4.5 running on BM & VM

• High I/O flavor• High memory flavor• High storage flavor

• Customized RPM• Customized to suit for eBay environment• Easy to install/upgrade, easy to maintain and ensure deployment

consistency across board and easy to identify deployment difference• Built in pre-defined tuning parameters when needed

• Homegrown client wrapper for central application logging and reporting• QA/LnP/PreProd/Prod environments

Page 13: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Couchbase Onboarding Process

Understand product limitation- Avoid known anti-pattern and look beyond generic use case

NoSQL product evaluation & selection- Business & Technology perspective- Product selection flowchart & detailed scoring card

Data modeling, POC with LnP, failover & DR testing- Review test result, re-evaluate initial assumptions

Capacity planning and provisioning

Page 14: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

eBay Couchbase At A Glance

Total Clusters

120

Total Servers

1400

Couchbase Calls

80B/day

Y-o-Y Growth

>100%

Total Data Size

90TB

Total Documents

60B

Peak Sets/Cluster

800,000/sec

Peak Gets/Cluster

1,200,000/sec

Page 15: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Typical Use Cases• Write Intensive

• user session tracking• 13 billion writes per day

• Read Intensive• email notification

• 4 billion reads per day• Mixed workload

• Central monitoring platform where metrics collected for hundreds of thousands of devices real time

• 2 billion writes per day• 10 billion reads per day

Page 16: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Global User Preference

• Global repository with streamlined service to managing world-wide user preferences which come from Data Warehouse

• Seller advertising• Member communication• User account setting• Notification

preferences, etc.

Page 17: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Central Monitoring and Alerting

Entire eBay site monitoring system is built on Couchbase!

• We have 2 set of clusters(active/passive) A(3 DC) and B(3 DC) for upgrade/patch

Page 18: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Elastic Scalability

• Benchmarking • performance baseline for new hardware, new software release• Enforce full scale testing in dedicated LnP env before going to production

• In general, scale out by adding more nodes to increase throughput or reduce latency• Sometimes, it’s cost-efficient to scale up at component level by Identifying scaling

bottleneck, then resolve it accordingly• Scale up(vertical)

• Smaller data center footprint, such as space, power, cooling• Less license cost

• Scale out(horizontal)• Cheaper using commodity hardware• More fault tolerant• (Unlimited) upgradability

Page 19: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Couchbase Learning Experience• Lack of always available writes

• Application option to write to remote DC when local write fails• Cross DC update conflict resolution

• Unpredictable behavior but new features in v4.6 solve this• Metadata memory overhead

• 56 bytes metadata is too much if you have a small key• Memory fragmentation

• CB 4.x replaces TCMalloc with jemalloc libraries • Slow rebalance

• Using swap rebalance when applicable• Slow warm-up

• Remove access log to speed up warm up• 10 bucket limits not working well for shared QA env

Page 20: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Couchbase Wish List

• We like to see • Point-in-time recovery so we can store SOR data• Global admin console to manage multi clusters• Smaller meta data to reduce memory requirement• Robust rebalance • Lazy warmup so failed node can join quicker• Simplified XDCR/Compaction tuning• One log, just one log

Page 21: eBay: From here to there: our journey to 1000s of nodes – Couchbase Connect 2016

Couchbase Connect 2016

Questions ?

eBay is hiring experienced NoSQL professionals, please send resume to [email protected]