Mo' Metrics, Mo' Problems
-
Upload
erin-willingham -
Category
Technology
-
view
82 -
download
4
Transcript of Mo' Metrics, Mo' Problems
MO’ METRICS, MO’ PROBLEMS
Erin WillinghamInfrastructure Engineer at Krux Digital
Twitter : GreenSilexhttps://www.linkedin.com/in/erin-willingham-104082126
GRAPHITE: THEN & NOWWhat works, what doesn't and why we did what we did
http://www.lowcountryafricana.com/wp-content/uploads/2015/10/Research-Plan-Chalkboard-Slate-1000px.jpg
GRAPHShttp://i.stack.imgur.com/WBsLg.png
<metric path> <metric value> <metric timestamp>
test.bash.stats.count_ps 50 1473048113
test/bash/stats/count_ps.wsp
statsd & collectd
relay
aggregator
graphite whisper
GRAPHITE 1.0 ARCHITECTURE
RULES, MERGING, EFFICIENCY & OPERATIONS
https://s-media-cache-ak0.pinimg.com/236x/21/ba/0f/21ba0fe48349a1d5382c261ac25cb6c6.jpg
Graphite
v1
Relays are aware of aggregation rules
Graphite Whisper merges metrics!
Graphite Aggregators are really efficient.
THREADING, SCALING, RELAY CPU, & STORAGE
http://i.dailymail.co.uk/i/pix/2012/06/30/article-2166781-13BCE32D000005DC-492_634x948.jpg
Graphite
v1
Python - single threaded
Relay is CPU intensive
Graphite Whisper - requires sharding and is very I/O intensive
http://obfuscurity.com/
Slow UI when using distributed remote backends
What are we trying to solve? What is forcing the change?
http://oakdome.com/k5/lesson-plans/photo-editing/wanted-poster/wanted-reward-poster-background.jpg
Storage!
Relay & Aggregator CPU usage high
Faster UI
KEEP COSTS LOW
http://3.bp.blogspot.com/-r9l7rltAjnM/Udq8kGlp65I/AAAAAAAAANo/VyQZN48nfMk/s1600/treasurepile.jpg
GRAPHITE ALTERNATIVES
http://3.bp.blogspot.com/-r9l7rltAjnM/Udq8kGlp65I/AAAAAAAAANo/VyQZN48nfMk/s1600/treasurepile.jpg
Circonus: All the insights you ever wantedHosted Graphite
Zabbix: OSS self hosted monitoring
CARBON-C-RELAY, KAFKA, SOCAT, CARBON-RELAY-NG, KAFKACAT
https://wtfbabe.files.wordpress.com/2015/06/kung-fury-23-wtf-watch-the-film-saint-pauly.jpeg
The Tools
Carbon-c-relay
https://github.com/grobian/carbon-c-relay
GRAPHITE 2.0TOOLS
Carbon-relay-ng
https://github.com/graphite-ng/carbon-relay-ng
GRAPHITE 2.0TOOLS
Kafka Producertcp-stream-kafka-producer
https://github.com/krux/tcp-stream-kafka-producer
GRAPHITE 2.0TOOLS
kafkacat
https://github.com/edenhill/kafkacat
GRAPHITE 2.0TOOLS
GRAPHITE 2.0TOOLS
socat
“exec:/usr/bin/kafkacat
-C
-o end
-b <kafka broker>
-t <kafka topic>”
,pty,ctty,echo=0,
tcp4-connect:localhost:<relay port>
BACKEND - STORAGE
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
• Whisper
• Ceres
• InfluxDB
• Cyanite
• Riak
• KairosDB
• OpenTSDB
Graphite - Whisper
InfluxDB
KairosDB
GRAPHITE 2.0 ARCHITECTURE
GRAPHITE ARCHITECTURE - SCALABLE
http://www.dinopit.com/wp-content/uploads/2012/07/dinosaur-cowboy.jpg
Why?
LOAD TESTING THE PARTS AND THE PIPELINE
https://github.com/feangulo/graphite-stresser
All the Metrics!
Metrics / min
WHAT WORKED?
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Pre-aggregatedPost Aggregated
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
MIRROR PRODUCTION DATA
https://c2.staticflickr.com/6/5278/5903002116_762783602c_b.jpg
UH OH!THE GRAPHS DON’T MATCH
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Old Cluster
New Cluster
HOW DO WE FIX THIS?
http://www.startres.net/startresWP/wp-content/uploads/2013/06/3702A.jpg
TESTING CARBON-RELAY-NG
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Carbon-relay-ng uses more than 2 CPUs!
FAILURE POINT FOR CARBON-RELAY-NG
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
Post Aggregated
Pre-aggregated
Carbon-relay-ng: room for improvement
• scale out aggregators horizontally• monitor for metrics per second and scale out as
needed• pass metrics that don’t need to be aggregated
directly to the backend
https://github.com/edenhill/kafkacat
SOLUTION
http://www.xzbackup.com/content/wp-content/uploads/2016/01/datacenter_triinti.jpg
QUESTIONS?