CouchConf SF 2012 Lightning Talk - Operational Excellence

25
Laine Campbell, Owner/Principal, [email protected] Charlie Killian, Director of Engineering, [email protected] Scaling and Performance for Operational Excellence

Transcript of CouchConf SF 2012 Lightning Talk - Operational Excellence

Page 1: CouchConf SF 2012 Lightning Talk - Operational Excellence

Laine Campbell, Owner/Principal, [email protected] Killian, Director of Engineering, [email protected]

Scaling and Performance for Operational Excellence

Page 2: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Who we are

● A boutique consultancy offering custom solutions.

● An operations support team providing a combined 100+ years of experience in distributed, performant and scalable solutions.

● A team of architects, engineers and operators who have worked at some of the most trafficked sites, games and companies since 1999.

Page 3: CouchConf SF 2012 Lightning Talk - Operational Excellence
Page 4: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Operational Excellence

● Configuration management and documentation.● Change management.● Availability management.● Incident and problem management● Backup, recovery and business continuity.● Monitoring and Trending.

Page 5: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Configuration Management

● Consistent couchbase configurations.○ Guis are great, but don't meet automation needs.

● Self documenting environments.

● Incorporating your infrastructure into your application to leverage couchbase ease of scale.

● Chef, puppet, ansible or "roll your own" using the couchbase API.

Page 6: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Change and Release Management

● Schemaless is great, but data governance is key.

● Your code needs to build a data dictionary or confusion reigns.

● DevOps style relationships build collaboration that can overcome the wild west mentality of schemaless environments.

Page 7: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Availability Management

● Moxi provides availability during node failures, supporting reads and writes.

● XDCR support in Couchbase 2.0 provides availability across datacenters and regions in an active/active topology.

● Special consideration in cloud environments must take into account AZ and region failovers.

Page 8: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Incident and Problem Management

● While not Couchbase specific, crucial to maintaining any highly available architecture.

● Appropriate alerting, response and communication processes ensure that isolated issues don't cascade into massive failures.

● Failing hardware, networks, design issues can all cause failures that can cascade into an entire cluster being down.

● Tracking recurring problems help with a continuous improvement on meeting SLAs.

Page 9: CouchConf SF 2012 Lightning Talk - Operational Excellence
Page 10: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Backup and Recovery

● Define your recovery SLAs.● Track how long backups take.● Test restores and track how long they take.● Recognize all failure scenarios:

○ Node failure○ Physical data corruption○ Logical data corruption○ Audits and forensics

Page 11: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Backup and Recovery 1.8

● In 1.8, per node backup is supported. Replica sets are also backed-up, which can cause long, or non-completing backups.

● SQLite3 can be used as a logical dump to ease backups.

● Cluster-wide consistency can not be guaranteed.● No incremental backups available.

Page 12: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Backup and Recovery 2.0

● Cluster wide backups are now available, as well as incremental.

● EBS snapshots (or LVM, hardware, etc...) work well due to log-style writes to disk.

● With incremental, it is easier to meet SLAs without breaking the bank on storage.

Page 13: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Monitoring and Alerting

● Use logs! Centralized syslogs, splunk, custom scripts to identify and track error types and rates.

● Track your app! Latency of web pages, forms and api-calls are key indicators.

● Define key alerts, make them actionable and tied to documentation.

● Palomino builds plugins and templates to provide proper alerts that are useful and work!

Page 14: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Trending and Diagnostics

● Alerts aren't enough, you must track usage and internal metrics to understand trends, workloads and bottlenecks.

● Graph everything! All exposed metrics, trend health checks.

● Interleave graphs from internal metrics to external factors: Code pushes, application metrics (logins, purchases, api calls)

Page 15: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Care and Feeding

● Regular performance reviews.● Defragmentation.● Incorporate recovery tests into building test and dev

environments.● Scale-up/Scale-down, preferably via automated

processes.● Rolling upgrades.● Coffee, pie, beer.

Page 16: CouchConf SF 2012 Lightning Talk - Operational Excellence
Page 17: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Partnering with Couchbase

Providing remote Architecture, Engineering and DBA services to clients.

Vendor neutral operations and scaling expertise for Couchbase clients in need of operators.

Page 18: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Remote Architecture and Engineering Services

● Architecture review and recommendations ● Data modeling● Data model migration● Data migration● Cluster sizing● Tools development

Page 19: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

DBA and Operations Services

● Infrastructure builds and management● Proactive operational support● 24x7 operational support with 30 minutes SLA● System health checks● Backup and recovery● Tuning for performance and scale● Query reviews, indexing, benchmarking● Capacity reviews

Page 20: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

How we can help

● Support your proof of concept● Migrate you to Couchbase Server● Support your Couchbase Server clusters

Page 21: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Is Couchbase Server a good fit?

● Architecture review● Data model review● Recommendation on moving to Couchbase Server● Data access best practices

Page 22: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Migrating from a RDBMS to Couchbase Server?

● Data model migration from relational to document● Data migration from SQL Server to Couchbase

Server● Couchbase Server cluster sizing● Infrastructure builds

Page 23: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Do you need operational experts?

● 24x7 operational support with 30 minutes SLA● Multiple Couchbase Server 1.8 clusters● Wanted Couchbase operational experts● Escalate to Couchbase for software support

Page 24: CouchConf SF 2012 Lightning Talk - Operational Excellence
Page 25: CouchConf SF 2012 Lightning Talk - Operational Excellence

*

Contact Info

Laine Campbell, [email protected] Killian, [email protected]

www.palominodb.com@palominodb on Twitter