From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe 2015

44
From No Git to 3000 GitHub Users and How to Keep Them Happy Dan Cundiff - @pmotch Target Corporation

Transcript of From No Git to 3000 GitHub Users and How to Keep Them Happy - GitHub Universe 2015

From No Git to 3000 GitHub Users and How to Keep Them Happy

Dan Cundiff - @pmotchTarget Corporation

A story about

leveling up Target

2013-01-24 17:36:48 UTC

Started on a server

under a desk

More teams wanted to use it

Seek protection, get allies,

defuse and convert the objectors

Graduate to proper infra

Sell the benefits

Familiar to developers

Social coding

Code sharing and discovery

Integration to anything

Making users happy

simple URL: git.target.com

just 1 instance

Single go-to wiki page

Things to put in a wiki page:

● How to get access● Profile setup● GitHub pages, emoji, issues● How to integrate with common tools● Link to post-mortems, how to reach admins● Link to GitHub docs for the rest

Stack wiki page too!

Document all admin things

Simple monitoring:

http synthetic transaction / 5 mins + PagerDuty

Do the active / passive setup

(practice failover)

Offline backups

(practice restores)

Upgrade <24 hours*

Security upgrade < 1 hr (asap)

3 part time admins

Automate admin

where you can

Plain speaking and friendly communication

Quick email help

Better: chat help

Even better: community help

Emergency:“We understand emergencies happen from time to time, so we've made available an email address you can write which will page us: ***. Literally, if you email this address, it will interrupt our dinner with our families or wake us up at 2:00am with SMS, push notices, ringing phones, beeps and all kinds of things which immediately grab our attention.”

GitHub support is 1st class, use it

Ask for user feedback

Be a good GHE admin and let GitHub know about issues

Post-mortems

Hey GHE users - here’s the post mortem as promised. It’s with a tear we write this as we almost reached a full year of zero unplanned downtime

resulting from a GHE defect or an action we took. But it was an action we took a few weeks ago that brought it down last night.

● On 2015-04-30, we upgraded to GHE v2.2.0. It was the first time GitHub strongly recommended taking an ESXi VM snapshot before

beginning the upgrade (we normally don’t because our rebuild procedure is well practiced.) We followed the recommendation.

● On 2015-05-21 at 9:35 PM CDT a Runscope synthetic transaction failed which triggered PagerDuty to call both GHE admins.

● git1 VM was hard down, but git2 was up, so we knew it wasn’t a network issue, but we didn’t want to failover given the state of the VM (if

we did, we might just cause the same issue on the hot standby.)

● By 10:15 PM we determined the culprit was a snapshot had exceeded the available disk space on the same volume where GHE runs.

● We deleted the snapshot; the subsequent consolidation process took about an hour to process.

● At 11:46 PM we brought GHE back online and it was working fine.

● We verified HA replication was functioning correctly AND that backups were running as normal. No data was lost (the thing we care about

most).

From now on, we’ll place the snapshot on a separate volume and/or remove the snapshot as soon as we determine it’s no longer needed.

While what happened is an embarrassing n00b mistake, we think it’s still important to talk about it and learn from it.

As always, let us know if you have any feedback for us. We always want to make this thing better.

“You guys rock. I can only imagine a world where every system we all use day-to-day had this much visibility about mistakes and, more importantly, how they will prevent them going forward. I’m going to challenge myself to start

doing retrospectives when my stuff fails.”

“...if you have to have a n00b mistake, the night before a 3

day weekend is the BEST time! On a serious note, keep up

the great work! LOVE my git!”

“So sad. Thanks for your efforts and the great up time

you give us!”

“The honesty is absolutely wonderful, thank you for not withholding bad news. Y’all look more professional as a

result.”

“great !== perfect. You guys are great.”

“This is awesome communication.”

We’re hiring!

Dan Cundiff - Target

@pmotch