High performance Infrastructure Oct 2013
-
Upload
server-density -
Category
Technology
-
view
128 -
download
0
description
Transcript of High performance Infrastructure Oct 2013
![Page 1: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/1.jpg)
High Performance Infrastructure
![Page 2: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/2.jpg)
David Mytton
Woop Japan!
![Page 3: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/3.jpg)
![Page 4: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/4.jpg)
Server Density Infrastructure
![Page 5: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/5.jpg)
•150 servers
Server Density Infrastructure
![Page 6: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/6.jpg)
• June 2009 - 4yrs
Server Density Infrastructure
•150 servers
![Page 7: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/7.jpg)
•MySQL -> MongoDB
• June 2009 - 4yrs
Server Density Infrastructure
•150 servers
![Page 8: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/8.jpg)
•MySQL -> MongoDB
•25TB data per month
• June 2009 - 4yrs
Server Density Infrastructure
•150 servers
![Page 9: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/9.jpg)
Picture is unrelated! Mmm, ice cream.
• Fast network
Performance
![Page 10: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/10.jpg)
• Fast network
Performance
EC2 10 Gigabit Ethernet- Cluster Compute- High Memory Cluster- Cluster GPU- High I/O- High Storage
- Network cards- VLAN separation
![Page 11: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/11.jpg)
• Fast network
Performance
Workload: Read/Write?
Result set size
What is being stored?
- Read / write: adds to replication oplog- Images? Web pages? Tiny documents?- What is being returned? Optimised to return certain fields?
![Page 12: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/12.jpg)
• Fast network
Performance
Use Network Throughput
Normal 0-100Mb/s
Replication (Initial Sync) Burst +100Mb/s
Replication (Oplog) 0-100Mb/s
Backup Initial Sync + Oplog
![Page 13: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/13.jpg)
• Fast network
Performance
Inter-DC LAN
- Latency
![Page 14: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/14.jpg)
• Fast network
Performance
Inter-DC LAN
Cross USA Washington, DC - San Jose, CA
![Page 15: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/15.jpg)
• Fast network
Performance
Location Ping RTT Latency
Within USA 40-80ms
Trans-Atlantic 100ms
Trans-Pacific 150ms
Europe - Japan 300ms
Ping - low overheadImportant for replication
![Page 16: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/16.jpg)
Failover
•Replication
![Page 17: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/17.jpg)
Failover
•Master/slave
•Replication
- One master accepts all writes- Many slaves staying up to date with master- Can read from slaves
![Page 18: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/18.jpg)
Failover
•Min 3 nodes
•Master/slave
•Replication
Minimum of 3 nodes to form a majority in case one goes down. All store data.Odd number otherwise != majorityArbiter
![Page 19: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/19.jpg)
Failover
•Min 3 nodes
•Master/slave
•Automatic failover
•Replication
Drivers handle automatic failover. First query after a failure will fail which will trigger a reconnect. Need to handle retries
![Page 20: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/20.jpg)
•Replication lag
Performance
Location Ping RTT Latency
Within USA 40-80ms
Trans-Atlantic 100ms
Trans-Pacific 150ms
Europe - Japan 300ms
- Replication lag
![Page 21: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/21.jpg)
Replication Lag
1. Reads: eventual consistency
![Page 22: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/22.jpg)
Replication Lag
1. Reads: eventual consistency
2. Failover: slave behind
![Page 23: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/23.jpg)
Eventual Consistency
Stale data
Not what the user submitted?
![Page 24: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/24.jpg)
Eventual Consistency
Stale data
Inconsistent data
Doesn’t reflect the truth
![Page 25: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/25.jpg)
Eventual Consistency
Stale data
Inconsistent data
Changing data
Could change on every page refresh
![Page 26: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/26.jpg)
Eventual Consistency
Use Case Needs consistency?
Graphs No
User profile Yes
Statistics Depends
Alert config Yes
Statistics - depends on when they’re updated
![Page 27: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/27.jpg)
Replication Lag
1. Reads: eventual consistency
2. Failover: slave behind
![Page 28: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/28.jpg)
Slave behind
Failover: out of date master
Old dataRollback
![Page 29: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/29.jpg)
• Safe by default
>>> from pymongo import MongoClient>>> connection = MongoClient(w=int/str)
Value Meaning
0 Unsafe
1 Primary
2 Primary + x1 secondary
3 Primary + x2 secondaries
MongoDB WriteConcern
wtimeout - wait for write before raising an exception
![Page 30: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/30.jpg)
![Page 31: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/31.jpg)
Picture is unrelated! Mmm, ice cream.
• Fast network
•More RAM
Performance
![Page 32: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/32.jpg)
http://www.slideshare.net/jrosoff/mongodb-on-ec2-and-ebs
No 32 bitNo High CPURAM RAM RAM.
![Page 33: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/33.jpg)
http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
![Page 34: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/34.jpg)
http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
![Page 35: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/35.jpg)
More RAM = expensive Performance
x2 4GB RAM 12 month Prices
![Page 36: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/36.jpg)
RAM
SSDs
Spinning disk
Cost Speed
![Page 37: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/37.jpg)
Softlayer disk pricing Performance
![Page 38: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/38.jpg)
EC2 disk/RAM pricing Performance
$2232/m
$2520/m
$43/m
$295/m
![Page 39: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/39.jpg)
SSD vs Spinning Performance
SSDs are better at buffered disk reads, sequential input and random i/o.
![Page 40: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/40.jpg)
SSD vs Spinning Performance
However, CPU usage for SSDs is higher. This may be a driver issue so worth testing your own hardware. Tests done using Bonnie.
![Page 41: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/41.jpg)
Cloud?
![Page 42: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/42.jpg)
Cloud?
•Elastic workloads
![Page 43: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/43.jpg)
Cloud?
•Elastic workloads
•Demand spikes
![Page 44: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/44.jpg)
Cloud?
•Elastic workloads
•Demand spikes
•Unknown requirements
![Page 45: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/45.jpg)
Dedicated?
![Page 46: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/46.jpg)
Dedicated?
•Hardware replacement
![Page 47: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/47.jpg)
Dedicated?
•Hardware replacement
•Managed/support
![Page 48: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/48.jpg)
Dedicated?
•Hardware replacement
•Managed/support
•Networking
![Page 49: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/49.jpg)
Colo?
![Page 50: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/50.jpg)
Colo?
•Hardware spec/value
![Page 51: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/51.jpg)
Colo?
•Hardware spec/value
•Total cost
![Page 52: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/52.jpg)
Colo?
•Hardware spec/value
•Total cost
•Internal skills?
![Page 53: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/53.jpg)
Colo?
•Hardware spec/value
•Total cost
•Internal skills?
•More fun?!
![Page 54: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/54.jpg)
![Page 55: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/55.jpg)
•Build master (buildbot): VM x2 CPU 2.0Ghz, 2GB RAM – $89/m
•Build slave (buildbot): VM x1 CPU 2.0Ghz, 1GB RAM – $40/m
•Staging load balancer: VM x1 CPU 2.0Ghz, 1GB RAM – $40/m
•Staging server 1: VM x2 CPU 2.0Ghz, 8GB RAM – $165/m
•Staging server 2: VM x1 CPU 2.0Ghz, 2GB RAM – $50/m
•Puppet master: VM x2 CPU 2.0Ghz, 2GB RAM – $89/m
Total: $473/m
Colo experiment
![Page 56: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/56.jpg)
Colo experiment
•Dell 1U R415
•x2 8C AMD 2.8Ghz
•32GB RAM
![Page 57: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/57.jpg)
Colo experiment
•Dell 1U R415
•x2 8C AMD 2.8Ghz
•32GB RAM
•Dual PSU, NIC
![Page 58: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/58.jpg)
Colo experiment
•Dell 1U R415
•x2 8C AMD 2.8Ghz
•32GB RAM
•Dual PSU, NIC
•x4 1TB SATA hot swappable
![Page 59: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/59.jpg)
![Page 60: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/60.jpg)
Colo: Networking
•10-50Mbps: £20-25/Mbps/m
•51-100Mbps: £15/Mbps/m
•100+Mbps: £13/Mbps/m
![Page 61: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/61.jpg)
Colo: Metro
•100Mbps: £300/m
•1000Mbps: £750/m
![Page 62: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/62.jpg)
Colo: Power
•£300-350/kWh/m
•4.5A = £520/m
•9A = £900/m
![Page 63: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/63.jpg)
![Page 64: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/64.jpg)
Tips: rand()
•Field names
-Field names take up space
![Page 65: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/65.jpg)
Tips: rand()
•Field names
•Covered indexes
- Get everything from the index
![Page 66: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/66.jpg)
Tips: rand()
•Field names
•Covered indexes
•Collections / databases
- Dropping collections faster than remove()- Split use cases across databases to avoid locking- Put databases onto different disks / types e.g. SSDs
![Page 67: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/67.jpg)
Backups
What is the use case?
![Page 68: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/68.jpg)
Backups
What is the use case?
Fixing user errors?
Point in time restore?
Disaster recovery?
![Page 69: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/69.jpg)
Backups
•Disaster recovery
Offsite
- What kind of disaster?- Store backups offsite
![Page 70: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/70.jpg)
Backups
Age
Offsite
•Disaster recovery
How log do you keep the backups for?How far do they go back?How recent are they?
![Page 71: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/71.jpg)
Backups
Age
Offsite
Restore time
•Disaster recovery
Latency issue - further away geographically, slower the transfer timePartition backups to get critical data restored first
![Page 72: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/72.jpg)
david@asriel ~: scp david@stelmaria:~/local/local.11 .local.11 100% 2047MB 6.8MB/s 05:01
Restore time
- Needed to resync a database server across the US- Take too long; oplog not large enough- Fast internal network but slow internet
![Page 73: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/73.jpg)
1d, 1h, 58m
11.22MB/s
![Page 74: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/74.jpg)
Backups
Frequency
Consistency
Verification
- How often?- Backing up cluster at the same time - data moving around- Can the backups be restored?
![Page 75: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/75.jpg)
www.flickr.com/photos/daddo83/3406962115/
Monitoring
•System
Disk i/o
Disk use
Disk i/o % utilDisk space usage
![Page 76: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/76.jpg)
david@pan ~: df -aFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda1 156882796 148489776 423964 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2097260 0 2097260 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_misc david@pan ~: df -ahFilesystem Size Used Avail Use% Mounted on/dev/sda1 150G 142G 415M 100% /proc 0 0 0 - /procnone 0 0 0 - /dev/ptsnone 2.1G 0 2.1G 0% /dev/shmnone 0 0 0 - /proc/sys/fs/binfmt_
- Needed to upgrade a machine- Resize = downtime- Resyncing finished just in time
![Page 77: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/77.jpg)
www.flickr.com/photos/daddo83/3406962115/
Monitoring
Disk i/o
Disk use
•System
Swap
Disk i/o % utilDisk space usage
![Page 78: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/78.jpg)
www.flickr.com/photos/daddo83/3406962115/
Monitoring
Slave lag
State
•Replication
![Page 79: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/79.jpg)
Monitoring tools
Run yourself
Ganglia
So Server Density is the tool my company produces but if you don’t like it, want to run your own tools locally or just want to try some others, then that’s fine.
![Page 80: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/80.jpg)
Monitoring tools
www.serverdensity.com
![Page 81: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/81.jpg)
On-call
Dealing with humans
- Sharing out the responsibility- Determining level of response: 24/7 real monitoring or first responder- 24/7 real monitoring for HA environments, real people at a screen at all times- First responder: people at the end of a phone
![Page 82: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/82.jpg)
On-call 1) Ops engineer
Dealing with humans
- During working hours our dedicated ops engineers take the first level- Avoids interrupting product engineers for initial fire fighting
![Page 83: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/83.jpg)
On-call 1) Ops engineer
2) All engineers
Dealing with humans
- Out of hours we rotate every engineer, product and ops- Rotation every 7 days on a Tuesday
![Page 84: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/84.jpg)
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
Dealing with humans
- Always have a secondary- This is always an ops engineer- Thinking is if the issue needs to be escalated then it’s likely a bigger problem that needs additional systems expertise
![Page 85: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/85.jpg)
On-call 1) Ops engineer
2) All engineers
3) Ops engineer
4) Others
Dealing with humans
- Support from design / frontend engineering- Have to press a button to get them involved
![Page 86: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/86.jpg)
Off-call
Dealing with humans
- Responders to an incident get next 24 hours off-call- Social issues to deal with
![Page 87: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/87.jpg)
On-call CEO
Dealing with humans
- I receive push notifications + e-mails for all outages
![Page 88: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/88.jpg)
Uptime reporting
Dealing with humans
- Weekly internal report on G+- Gives visibility to entire company about any incidents- Allows us to discuss incidents to get to that 100% uptime
![Page 89: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/89.jpg)
Social issues
Dealing with humans
- How quickly can you get to a computer?- Are they out drinking on a Friday?- What happens if someone is ill?- What if there’s a sudden emergency: accident? family emergency?- Do they have enough phone battery?- Can you hear the ringtone?
![Page 90: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/90.jpg)
Backup responder
Dealing with humans
- Backup responder- Time out the initial responder- Escalate difficult problems- Essentially human redundancy: phone provider, geographic area, internet connectivity
![Page 91: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/91.jpg)
Expected
Dealing with outages
- Outages are going to happen, especially at the beginning- Costs money for redundancy- How you deal with them
![Page 92: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/92.jpg)
Dealing with outages
Externally
Communication
- Telling people what is happening- Frequently- Dependent on audience - we can go into more detail because our customers are techies- Github do a good job of providing incident writeups but don’t provide a good idea of what is happening right now- Generally Amazon and Heroku are good and go into more detail
![Page 93: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/93.jpg)
Communication
Dealing with outages
Internally
- Open Skype conferences between the responders- Usually mostly silence or the sound of the keyboard, but simulates being in the situation room- Faster than typing
![Page 94: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/94.jpg)
Really test your vendors
Dealing with outages
- Shows up flaws in vendor support processes- Frustrating when waiting on someone else- You want as much information as possible- Major outage? Everyone will be calling them
![Page 95: High performance Infrastructure Oct 2013](https://reader037.fdocuments.net/reader037/viewer/2022110302/548dede6b4795995708b4574/html5/thumbnails/95.jpg)
Simulations
Dealing with outages
- Try and avoid unncessary problems- Do servers come back up from boot?- Can hot spares handle the load?- Test failover: databases, HA firewalls- Regularly reboot servers- Wargames can happen at another stage: startups are usually too focused on building things first