Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the...
Transcript of Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the...
![Page 1: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/1.jpg)
Operations at Twitter
John AdamsTwitter Operations
![Page 2: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/2.jpg)
John Adams / @netik• Early Twitter employee
• Lead engineer: Application Services(Apache, Unicorn, SMTP, etc...)
• Keynote Speaker: O’Reilly Velocity 2009
• O’Reilly Web 2.0 Speaker (2008, 2010)
• Previous companies: Inktomi, Apple, c|net
![Page 3: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/3.jpg)
What changed since Velocity ’09?
• Specialized services for social graph storage
• More efficient use of Apache
• Unicorn (Rails)
• More servers, more LBs, more humans
• Memcached partitioning - dedicated pools+hosts
• More process, more science.
![Page 4: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/4.jpg)
210employees
sharding humans is difficult.
![Page 5: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/5.jpg)
75%
25%
APIWeb
![Page 6: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/6.jpg)
160KRegistered Apps
source: twitter.com internal
![Page 7: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/7.jpg)
700MSearches/Day
source: twitter.com internal, includes api based searches
![Page 8: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/8.jpg)
65MTweets per day
source: twitter.com internal
(~750 Tweets/sec)
![Page 9: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/9.jpg)
2,940 TPSJapan Scores!
3,085 TPSLakers Win!
![Page 10: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/10.jpg)
Operations
• Support the site and the developers
• Make it performant
• Capacity Planning (metrics-driven)
• Configuration Management
• Improve existing architecture and plan for future
![Page 11: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/11.jpg)
Nothing works the first time. • Scale site using best available technologies
• Plan to build everything more than once.
• Most solutions work to a certain level of scale, and then you must re-evaluate to grow.
• We’re doing this now.
![Page 12: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/12.jpg)
MTTD
![Page 13: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/13.jpg)
MTTR
![Page 14: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/14.jpg)
Operations Mantra
Find Weakest
Point
Metrics + Logs + Science =
Analysis
![Page 15: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/15.jpg)
Operations Mantra
Find Weakest
Point
Metrics + Logs + Science =
Analysis
Take Corrective
Action
Process
![Page 16: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/16.jpg)
Operations Mantra
Find Weakest
Point
Metrics + Logs + Science =
Analysis
Take Corrective
Action
Move to Next
Weakest Point
Process Repeatability
![Page 17: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/17.jpg)
Monitoring• Twitter graphs and reports critical metrics in
as near to real time as possible
• If you build tools against our API, you should too.
• Use this data to inform the public
• dev.twitter.com - API availability
• status.twitter.com
![Page 18: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/18.jpg)
Sysadmin 2.0• Don’t be a “systems administrator” anymore.
• Combine statistical analysis and monitoring to produce meaningful results
• Make decisions based on data
![Page 19: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/19.jpg)
Profiling• Low-level
• Identify bottlenecks inside of core tools
• Latency, Network Usage, Memory leaks
• Methods
• Network services: tcpdump + tcpdstat, yconalyzer
• Introspect with Google perftools
![Page 20: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/20.jpg)
Data Analysis• Instrumenting the world pays off.
• “Data analysis, visualization, and other techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!”
“Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009
![Page 21: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/21.jpg)
Rails• Front-end (Scala/Java back-end)
• Not to blame for our issues. Analysis found:
• Caching + Cache invalidation problems
• Bad queries generated by ActiveRecord, resulting in slow queries against the db
• Garbage Collection issues (20-25%)
• Replication Lag
![Page 22: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/22.jpg)
Analyze• Turn data into information
• Where is the code base going?
• Are things worse than they were?
• Understand the impact of the last software deploy
• Run check scripts during and after deploys
• Capacity Planning, not Fire Fighting!
![Page 23: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/23.jpg)
Logging• Syslog doesn’t work at high traffic rates
• No redundancy, no ability to recover from daemon failure
• Moving large files around is painful
• Solution:
• Scribe to HDFS with LZO Compression
![Page 24: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/24.jpg)
• “Criticals” view
• Smokeping/MRTG
• Google Analytics
• Not just for HTTP 200s/SEO
• XML Feeds from managed services
Dashboard
![Page 25: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/25.jpg)
Whale Watcher
• Simple shell script, Huge Win
• Whale = HTTP 503 (timeout)
• Robot = HTTP 500 (error)
• Examines last 60 seconds of aggregated daemon / www logs
• “Whales per Second” > Wthreshold
• Thar be whales! Call in ops.
![Page 26: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/26.jpg)
Change Management• Reviews in Reviewboard
• Puppet + SVN
• Hundreds of modules
• Runs constantly
• Reuses tools that engineers use
![Page 27: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/27.jpg)
Deploy Watcher
Sample window: 300.0 secondsFirst start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010)Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010)
PRODUCTION APACHE: ALL OKPRODUCTION OTHER: ALL OKWEB049 CANARY APACHE: ALL OKWEB049 CANARY BACKEND SERVICES: ALL OKDAEMON031 CANARY BACKEND SERVICES: ALL OKDAEMON031 CANARY OTHER: ALL OK
![Page 28: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/28.jpg)
Deploys• Block deploys if site in error state
• Graph time-of-deploy along side server CPU and Latency
• Display time-of-last-deploy on dashboard
• Communicate deploys in Campfire to teams
^^ last deploy times ^^
![Page 29: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/29.jpg)
Feature “Darkmode”• Specific site controls to enable and disable
computationally or IO-Heavy site function
• The “Emergency Stop” button
• Changes logged and reported to all teams
• Around 90 switches we can throw
• Static / Read-only mode
![Page 30: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/30.jpg)
subsystems
![Page 31: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/31.jpg)
loony• Central machine database (MySQL)
• Python, Django, Paraminko SSH
• Paraminko - Twitter’s OSS SSH Libary
• Ties into LDAP
• When data center sends us email, machine definitions built in real-time
• On demand changes with run
![Page 32: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/32.jpg)
Murder• Bittorrent based replication for deploys
(Python w/libtorrent)
• ~30-60 seconds to update >1k machines
• Gets work list from loony
• Legal P2P
![Page 33: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/33.jpg)
memcached• Network Memory Bus isn’t infinite
• Evictions make the cache unreliable for important configuration data(loss of darkmode flags, for example)
• Segmented into pools for better performance
• Examine slab allocation and watch for high use/eviction rates on individual slabs using peep. Adjust slab factors and size accordingly.
![Page 34: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/34.jpg)
request flowLoad Balancers
Apache
Rails (Unicorn)
Flock Kestrel Memcached
MySQL Cassandra
Daemons Mail ServersMonitoring
![Page 35: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/35.jpg)
Unicorn Rails Server• Connection push to socket polling model
• Deploys without Downtime
• Less memory and 30% less CPU
• Shift from ProxyPass to Proxy Balancer
• Apache’s not better than ngnix.
• It’s the proxy.
![Page 36: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/36.jpg)
Asynchronous Requests• Inbound traffic consumes a worker
• Outbound traffic consumes a worker
• The request pipeline should not be used to handle 3rd party communications or back-end work.
• Move long running work to daemons when possible.
![Page 37: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/37.jpg)
Kestrel• Works like memcache (same protocol)
• SET = enqueue | GET = dequeue
• No strict ordering of jobs
• No shared state between servers
• Written in Scala.
![Page 38: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/38.jpg)
Daemons• Many different types at Twitter.
• Old way: One Daemon per type
• New Way: One Daemon, many jobs
• Daemon Slayer
• A Multi Daemon that does many different jobs, all at once.
![Page 39: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/39.jpg)
Flock DB• Shard the social
graph through Gizzard
• Billions of edges
• MySQL backend
• Open Source (available now)
Flock DB
Gizzard
Mysql Mysql Mysql
![Page 40: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/40.jpg)
Disk is the new Tape.• Social Networking application profile has
many O(ny) operations.
• Page requests have to happen in < 500mS or users start to notice. Goal: 250-300mS
• Web 2.0 isn’t possible without lots of RAM
• What to do?
![Page 41: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/41.jpg)
Caching• We’re the real-time web, but lots of caching
opportunity
• Most caching strategies rely on long TTLs (>60 s)
• Separate memcache pools for different data types to prevent eviction
• Optimize Ruby Gem to libmemcached + FNV Hash instead of Ruby + MD5
• Twitter largest contributor to libmemcached
![Page 42: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/42.jpg)
Caching• “Cache Everything!” not the best policy
• Invalidating caches at the right time is difficult.
• Cold Cache problem; What happens after power or system failure?
• Use cache to augment db, not to replace
![Page 43: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/43.jpg)
MySQL Challenges• Replication Delay
• Single threaded replication = pain.
• Social Networking not good for RDBMS
• N x N relationships and social graph / tree traversal - we have FlockDB for that
• Disk issues
• FS Choice, noatime, scheduling algorithm
![Page 44: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/44.jpg)
Database Replication• Major issues around users and statuses tables
• Multiple functional masters (FRP, FWP)
• Make sure your code reads and writes to the write DBs. Reading from master = slow death
• Monitor the DB. Find slow / poorly designed queries
• Kill long running queries before they kill you (mkill)
![Page 45: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/45.jpg)
In closing...• Use configuration management, no matter
your size
• Make sure you have logs of everything
• Plan to build everything more than once
• Instrument everything and use science.
• Do it again.
![Page 46: Operations at Twitter - O'Reilly Mediaassets.en.oreilly.com/1/event/44/In the Belly of the Whale... · 700M Searches/Day source: twitter.com internal, includes api based searches](https://reader031.fdocuments.net/reader031/viewer/2022031315/5c4467c993f3c34c643d6c89/html5/thumbnails/46.jpg)
Thanks!• We support and use Open Source
• http://twitter.com/about/opensource
• Work at scale - We’re hiring.
• @jointheflock