DevOps Naughties Style - How We DevOps at MP3.com in the Early 2000's
-
Upload
techopsguru -
Category
Technology
-
view
640 -
download
1
Transcript of DevOps Naughties Style - How We DevOps at MP3.com in the Early 2000's
DevOps Naughties Style(how we did DevOps at MP3.com in the early 2000’s)
MP3.com – Dec 1998
Probably not the only ones but there wasn’t a good way to know that
Few sources of information for internet startups – little sharing of information on reliable, scalable internet architecture
MP3.com: We Did DevOps Before It Was Called DevOps
Probably not the only ones but there wasn’t a good way to know that
Few sources of information for internet startups – little sharing of information on reliable, scalable internet architecture
After Vivendi bought us we were more than just MP3.com – we hosted all of the above
MP3.com: We Did DevOps Before It Was Called DevOps
There were all sorts of other internet sites starting up but we were one of the first with small requests resulting in large (comparatively) responses
Our Problem(s)
There were all sorts of other internet sites starting up but we were one of the first with small requests resulting in large (comparatively) responses
A quickly growing community sharing .mp3 files – 4 million mp3 files, 25 million users
Our Problem(s)
There were all sorts of other internet sites starting up but we were one of the first with small requests resulting in large (comparatively) responses
A quickly growing community sharing .mp3 files – 4 million mp3 files, 25 million users
At our peak we quite happily kept 1.2Gbps busy with audio files – progressive downloads
Our Problem(s)
There were all sorts of other internet sites starting up but we were one of the first with small requests resulting in large (comparatively) responses
A quickly growing community sharing .mp3 files – 4 million mp3 files, 25 million users
At our peak we quite happily kept 1.2Gbps busy with audio files – progressive downloads
Had to come up with solutions to scaling problems at a time when few, if any, had done so before
Our Problem(s)
OK, it wasn’t called DevOps but :◦ We used tools to make a workflow and worked
together as an Engineering team rather than against each other as was more typical at the time (and still is in some (many?) places)
◦ This included my Ops team being a integral part of the Engineering group not a separate organization
Our Solution - DevOps
Our workflow included mainly open source tools – we’ll look at some
Strict procedures and protocols I’ll start from the Ops side
Overview
Change is bad – we tried to use the same machine types as long as possible – although with the speed of growth that was sometimes only months
Machine Builds
Change is bad – we tried to use the same machine types as long as possible – although with the speed of growth that was sometimes only months
Kept to as few basic types as possible – our last Linux whitebox version was out MkV server (based on a Supermicro chassis)
Machine Builds
Single unified base OS install (Redhat – pre RHEL from 5 through 8) – Kickstart install
Machine Builds II
Single unified base OS install (Redhat – pre RHEL from 5 through 8) – Kickstart install
Solaris boxes where we needed a journalled filesystem (we funded ReiserFS later on) – jumpstart installed
Machine Builds II
Single unified base OS install (Redhat – pre RHEL from 5 through 8) – Kickstart install
Solaris boxes where we needed a journalled filesystem (we funded ReiserFS later on) – jumpstart installed
We build 20 or 30 boxes ahead and could build 10 boxes at a time in around 10 minutes (total) – all because we knew exactly what was going on the machines & that they were the same basic install.
Machine Builds II
Clustering – something we had to learn for ourselves. Other people did it too but we didn’t know that
Redundancy and Clustering
Clustering – something we had to learn for ourselves. Other people did it too but we didn’t know that
Initially RadWare load balancers, then F5
Redundancy and Clustering
Clustering – something we had to learn for ourselves. Other people did it too but we didn’t know that
Initially RadWare load balancers, then F5 Lots of small unrelated clusters
Redundancy and Clustering
Clustering – something we had to learn for ourselves. Other people did it too but we didn’t know that
Initially RadWare load balancers, then F5 Lots of small unrelated clusters Every cluster did as little as possible and
had as few dependencies as possible – mainly just DB
Redundancy and Clustering
Clustering – something we had to learn for ourselves. Other people did it too but we didn’t know that
Initially RadWare load balancers, then F5 Lots of small unrelated clusters Every cluster did as little as possible and
had as few dependencies as possible – mainly just DB
Naming Convention for hosts and clusters
Redundancy and Clustering
XXyyyyyynn◦ XX = two letter code for datacenter◦ yyyyy = a short descriptive name for a cluster◦ nn = a two digit numeric for the cluster
Examples :◦ sdwww03◦ sjdb01
Naming Convention
Mon : this was an open source monitoring tool (you can still find it here ftp://mirror.csclub.uwaterloo.ca/slackware/slackware-4.0/kernel.org/software/admin/mon/html/news.html) ~= Nagios/Sensu
RRDtool/SNMP data gathering for all servers, storage devices. Graphs were our own – I can’t find screenshots for this ~= Cacti/Graphite
Monitoring Tools
We had over 130TB of storage. Sounds small today but 4TB was a whole rack
Storage
One of our storage towers was delivered just a little carelessly - $120K of disk :
Storage Ooops
All machines reported to MachDB which collected data about the machine and reported back (~= ohai)
Host Management
All machines reported to MachDB which collected data about the machine and reported back (~= ohai)
We manually entered a switch location on install
Host Management
All machines reported to MachDB which collected data about the machine and reported back (~= ohai)
We manually entered a switch location on install
We could know exactly which rack any of the 2000 hosts were on our network and create web based maps of every machine
Host Management
All machines reported to MachDB which collected data about the machine and reported back (~= ohai)
We manually entered a switch location on install
We could know exactly which rack any of the 2000 hosts were on our network and create web based maps of every machine
MachDB is still around: http://www.machdb.org/
Host Management
Centralized User Management – built internally
User Management
Centralized User Management – built internally
Based on LDAP – used by whole company not just Engineering
User Management
Centralized User Management – built internally
Based on LDAP – used by whole company not just Engineering
/etc/passwd and /etc/group built individually for each server and distributed via SSH
User Management
Centralized User Management – built internally
Based on LDAP – used by whole company not just Engineering
/etc/passwd and /etc/group built individually for each server and distributed via SSH
Combined with local versions to ensure if the remote file was bad at least we could get in as root
User Management
Centralized User Management – built internally
Based on LDAP – used by whole company not just Engineering
/etc/passwd and /etc/group built individually for each server and distributed via SSH
Combined with local versions to ensure if the remote file was bad at least we could get in as root
Root user passwords only within Ops group
User Management
Unified problem reporting over the company using an in house web based ticketing system - tix
Tix had many of the features of today’s tools – user assignment, escalation etc.
Ticketing
Divided into the standard dev, qa/staging, production
Dev and staging systems built using the same kickstart as production (can we see chef/puppet here)
CM using CVS◦ A whole group dealt with merge issues
Development
We used CVS – no svn or git
Version Control
We used CVS – no svn or git Problem was that CVS versions per file so
no overall state
Version Control
We used CVS – no svn or git Problem was that CVS versions per file so
no overall state Started using manifest files – every
application had a list of version vs files and dependencies. Manifest files were version controlled themselves.
Version Control
We used CVS – no svn or git Problem was that CVS versions per file so
no overall state Started using manifest files – every
application had a list of version vs files and dependencies. Manifest files were version controlled themselves.
This evolved into a tool that built artifact bundles (tar) from manifests for use by cfengine
Version Control
Every cluster had a ’00’ machine using our naming convention eg. sdwww00
QA/Staging
Every cluster had a ’00’ machine using our naming convention eg. sdwww00
On deployment code would be put on this machine first and tested by QA
QA/Staging
Every cluster had a ’00’ machine using our naming convention eg. sdwww00
On deployment code would be put on this machine first and tested by QA
We knew what code to deploy from the manifest file – just use the version number from that
QA/Staging
Every cluster had a ’00’ machine using our naming convention eg. sdwww00
On deployment code would be put on this machine first and tested by QA
We knew what code to deploy from the manifest file – just use the version number from that
Testing was a mixture of manual and scripted regression tests & new feature tests
QA/Staging
We ended up using cfengine to deploy code
cfengine
We ended up using cfengine to deploy code Using manifests artifact files from ‘00’
machines were distributed to relevant servers
cfengine
We ended up using cfengine to deploy code Using manifests artifact files from ‘00’
machines were distributed to relevant servers
Links to change to latest code ( = capistrano/chef deploy)
cfengine
We ended up using cfengine to deploy code Using manifests artifact files from ‘00’
machines were distributed to relevant servers
Links to change to latest code ( = capistrano/chef deploy)
We also used cfengine for host package updates and OS level config files (/etc/resolv,conf etc.)
cfengine
Rough Infrastructure Diagram
Single server NFS mounted disk storage
◦ Soon had problems so added NetApps for storage
Early Infrastructure
Single server NFS mounted disk storage
◦ Soon had problems so added NetApps for storage◦ Heat Issues in the Datacenter we were in
Early Infrastructure
Single server NFS mounted disk storage
◦ Soon had problems so added NetApps for storage◦ Heat Issues in the Datacenter we were in◦ Added squid caches & moved to a bigger better
datacenter 3am datacenter visits ended…
Early Infrastructure
Single server NFS mounted disk storage
◦ Soon had problems so added NetApps for storage◦ Heat Issues in the Datacenter we were in◦ Added squid caches & moved to a bigger better
datacenter 3am datacenter visits ended… Rain in the server room ended Random disk pulls on live servers also
ended
Early Infrastructure
We needed Geographic redundancy
Datacenter Duplication
We needed Geographic redundancy Business deal with WorldCom ended up with
a second datacenter in…. San Jose – not perfect but better
Datacenter Duplication
We needed Geographic redundancy Business deal with WorldCom ended up with
a second datacenter in…. San Jose – not perfect but better
AT&T wouldn’t give us good bandwidth pricing so the new Datacenter had a pre-populated Squid cache
Datacenter Duplication
We needed Geographic redundancy Business deal with WorldCom ended up with
a second datacenter in…. San Jose – not perfect but better
AT&T wouldn’t give us good bandwidth pricing so the new Datacenter had a pre-populated Squid cache (3TB of cache)
Cut AT&T bandwidth in ½ overnight when we turned it on
Datacenter Duplication
We needed Geographic redundancy Business deal with WorldCom ended up with
a second datacenter in…. San Jose – not perfect but better
AT&T wouldn’t give us good bandwidth pricing so the new Datacenter had a pre-populated Squid cache (3TB of cache)
Cut AT&T bandwidth in ½ overnight when we turned it on
Better AT&T bandwidth pricing
Datacenter Duplication
Respect across disciplines Great work environment Fast pace Cohesive management for Dev and Ops
Company Atmosphere
2003
MP3.com the domain was sold to CNet and still lives on as a completely different site
Much of the infrastructure design and machines lived on as the new Napster (ex-Pressplay, now part of Rhapsody)
Original MP3.com employees are *everywhere*
Sold and Resurrected