DevOps Naughties Style - How We DevOps at MP3.com in the Early 2000's

DevOps Naughties Style(how we did DevOps at MP3.com in the early 2000’s)

[email protected]

mailto:[email protected]

MP3.com – Dec 1998

Probably not the only ones but there wasn’t a good way to know that

Few sources of information for internet startups – little sharing of information on reliable, scalable internet architecture

MP3.com: We Did DevOps Before It Was Called DevOps

Probably not the only ones but there wasn’t a good way to know that

Few sources of information for internet startups – little sharing of information on reliable, scalable internet architecture

After Vivendi bought us we were more than just MP3.com – we hosted all of the above

MP3.com: We Did DevOps Before It Was Called DevOps

There were all sorts of other internet sites starting up but we were one of the first with small requests resulting in large (comparatively) responses

Our Problem(s)


A quickly growing community sharing .mp3 files – 4 million mp3 files, 25 million users

Our Problem(s)



At our peak we quite happily kept 1.2Gbps busy with audio files – progressive downloads

Our Problem(s)



At our peak we quite happily kept 1.2Gbps busy with audio files – progressive downloads

Had to come up with solutions to scaling problems at a time when few, if any, had done so before

Our Problem(s)

OK, it wasn’t called DevOps but :◦ We used tools to make a workflow and worked

together as an Engineering team rather than against each other as was more typical at the time (and still is in some (many?) places)

◦ This included my Ops team being a integral part of the Engineering group not a separate organization

Our Solution - DevOps

Our workflow included mainly open source tools – we’ll look at some

Strict procedures and protocols I’ll start from the Ops side

Overview

Change is bad – we tried to use the same machine types as long as possible – although with the speed of growth that was sometimes only months

Machine Builds

Change is bad – we tried to use the same machine types as long as possible – although with the speed of growth that was sometimes only months

Kept to as few basic types as possible – our last Linux whitebox version was out MkV server (based on a Supermicro chassis)

Machine Builds

Single unified base OS install (Redhat – pre RHEL from 5 through 8) – Kickstart install

Machine Builds II


Solaris boxes where we needed a journalled filesystem (we funded ReiserFS later on) – jumpstart installed

Machine Builds II


Solaris boxes where we needed a journalled filesystem (we funded ReiserFS later on) – jumpstart installed

We build 20 or 30 boxes ahead and could build 10 boxes at a time in around 10 minutes (total) – all because we knew exactly what was going on the machines & that they were the same basic install.

Machine Builds II

Clustering – something we had to learn for ourselves. Other people did it too but we didn’t know that

Redundancy and Clustering


Initially RadWare load balancers, then F5



Initially RadWare load balancers, then F5 Lots of small unrelated clusters



Initially RadWare load balancers, then F5 Lots of small unrelated clusters Every cluster did as little as possible and

had as few dependencies as possible – mainly just DB



Initially RadWare load balancers, then F5 Lots of small unrelated clusters Every cluster did as little as possible and

had as few dependencies as possible – mainly just DB

Naming Convention for hosts and clusters


XXyyyyyynn◦ XX = two letter code for datacenter◦ yyyyy = a short descriptive name for a cluster◦ nn = a two digit numeric for the cluster

Examples :◦ sdwww03◦ sjdb01

Naming Convention

Mon : this was an open source monitoring tool (you can still find it here ftp://mirror.csclub.uwaterloo.ca/slackware/slackware-4.0/kernel.org/software/admin/mon/html/news.html) ~= Nagios/Sensu

RRDtool/SNMP data gathering for all servers, storage devices. Graphs were our own – I can’t find screenshots for this ~= Cacti/Graphite

Monitoring Tools

ftp://mirror.csclub.uwaterloo.ca/slackware/slackware-4.0/kernel.org/software/admin/mon/html/news.html




We had over 130TB of storage. Sounds small today but 4TB was a whole rack

Storage

One of our storage towers was delivered just a little carelessly - $120K of disk :

Storage Ooops

All machines reported to MachDB which collected data about the machine and reported back (~= ohai)

Host Management


We manually entered a switch location on install

Host Management



We could know exactly which rack any of the 2000 hosts were on our network and create web based maps of every machine

Host Management



We could know exactly which rack any of the 2000 hosts were on our network and create web based maps of every machine

MachDB is still around: http://www.machdb.org/

Host Management

http://www.machdb.org/

http://www.machdb.org/

Centralized User Management – built internally

User Management


Based on LDAP – used by whole company not just Engineering

User Management



/etc/passwd and /etc/group built individually for each server and distributed via SSH

User Management




Combined with local versions to ensure if the remote file was bad at least we could get in as root

User Management




Combined with local versions to ensure if the remote file was bad at least we could get in as root

Root user passwords only within Ops group

User Management

Unified problem reporting over the company using an in house web based ticketing system - tix

Tix had many of the features of today’s tools – user assignment, escalation etc.

Ticketing

Divided into the standard dev, qa/staging, production

Dev and staging systems built using the same kickstart as production (can we see chef/puppet here)

CM using CVS◦ A whole group dealt with merge issues

Development

We used CVS – no svn or git

Version Control

We used CVS – no svn or git Problem was that CVS versions per file so

no overall state

Version Control


no overall state Started using manifest files – every

application had a list of version vs files and dependencies. Manifest files were version controlled themselves.

Version Control


no overall state Started using manifest files – every

application had a list of version vs files and dependencies. Manifest files were version controlled themselves.

This evolved into a tool that built artifact bundles (tar) from manifests for use by cfengine

Version Control

Every cluster had a ’00’ machine using our naming convention eg. sdwww00

QA/Staging


On deployment code would be put on this machine first and tested by QA

QA/Staging



We knew what code to deploy from the manifest file – just use the version number from that

QA/Staging



We knew what code to deploy from the manifest file – just use the version number from that

Testing was a mixture of manual and scripted regression tests & new feature tests

QA/Staging

We ended up using cfengine to deploy code

cfengine

We ended up using cfengine to deploy code Using manifests artifact files from ‘00’

machines were distributed to relevant servers

cfengine



Links to change to latest code ( = capistrano/chef deploy)

cfengine



Links to change to latest code ( = capistrano/chef deploy)

We also used cfengine for host package updates and OS level config files (/etc/resolv,conf etc.)

cfengine

Rough Infrastructure Diagram

Single server NFS mounted disk storage

◦ Soon had problems so added NetApps for storage

Early Infrastructure


◦ Soon had problems so added NetApps for storage◦ Heat Issues in the Datacenter we were in



◦ Soon had problems so added NetApps for storage◦ Heat Issues in the Datacenter we were in◦ Added squid caches & moved to a bigger better

datacenter 3am datacenter visits ended…



◦ Soon had problems so added NetApps for storage◦ Heat Issues in the Datacenter we were in◦ Added squid caches & moved to a bigger better

datacenter 3am datacenter visits ended… Rain in the server room ended Random disk pulls on live servers also

ended


We needed Geographic redundancy

Datacenter Duplication

We needed Geographic redundancy Business deal with WorldCom ended up with

a second datacenter in…. San Jose – not perfect but better




AT&T wouldn’t give us good bandwidth pricing so the new Datacenter had a pre-populated Squid cache




AT&T wouldn’t give us good bandwidth pricing so the new Datacenter had a pre-populated Squid cache (3TB of cache)

Cut AT&T bandwidth in ½ overnight when we turned it on




AT&T wouldn’t give us good bandwidth pricing so the new Datacenter had a pre-populated Squid cache (3TB of cache)

Cut AT&T bandwidth in ½ overnight when we turned it on

Better AT&T bandwidth pricing


Respect across disciplines Great work environment Fast pace Cohesive management for Dev and Ops

Company Atmosphere

MP3.com the domain was sold to CNet and still lives on as a completely different site

Much of the infrastructure design and machines lived on as the new Napster (ex-Pressplay, now part of Rhapsody)

Original MP3.com employees are *everywhere*

Sold and Resurrected

DevOps Naughties Style - How We DevOps at MP3.com in the Early 2000's

Technology

Transcript of DevOps Naughties Style - How We DevOps at MP3.com in the Early 2000's