Metrics stack 2.0
-
Upload
dieterbe -
Category
Engineering
-
view
115 -
download
3
description
Transcript of Metrics stack 2.0
Credit: user niteroi @ panoramio.com
vimeo.com/43800150
1 Metrics 2.0 concepts
2 Implementation
3 Advanced stuff
“Dieter” ?
Peter Deter→
Terminology sync
(1234567890, 82)
(1234567900, 123)
(1234567910, 109)
(1234567920, 77)
db15.mysql.queries_running
host=db15 mysql.queries_running
How many pagerequests/s is vimeo.com doing?
● stats.hits.vimeo_com
● stats_counts.hits.vimeo_com
stats.<host>.requesthostport.vimeo_com_443
stats.timers.dfs5.proxyserver.object.GET.200.timing.upper_90
O(X*Y*Z)X = # apps
Y = # people
Z = # aggregators
How long does it take to retrieve an object from swift?
stats.timers.<host>.proxyserver.<swift_type>.<http_method>.<http_code>.timing.<stat>
stats.timers.<host>.objectserver.<http_method>.timing.<stat>
target=stats.timers.dfs*.object*GET*timing.mean ?
target=groupByNode(stats.timers.dfs*.proxyserver.object.GET.*.timing.mean,2,"avg")
target=stats.timers.dfs*.objectserver.GET.timing.mean
swift_type=object stat=mean timing GET avg by http_code
O((DxV)^2)D = # dimensions
V = # values per dim
collectd.db.disk.sda1.disk_time.write
What should I name my metric?
101001000
100001000001000000
Metrics 2.0
Old:● information lacking
● fields unclear & inconsistent
● cumbersome strings / trees
● forbidden characters
New:● Selfdescribing
● Standardized
● all dimensions in orthogonal tagspace
● Allow some useful characters
stats.timers.dfs5.proxyserver.object.GET.200.timing.upper_90
{ “server”: “dfvimeodfsproxy5”, “http_method”: “GET”, “http_code”: “200”, “unit”: “ms”, “target_type”: “gauge”, “stat”: “upper_90”, “swift_type”: “object” “plugin”: “swift_proxy_server”}
Main advantages:● Immediate understanding of metric meaning (ideally)
● Minimize time to graphs, dashboards, alerting rules
github.com/vimeo/graphexplorer/wiki
SI + IEC
B Err Warn Conn Job File Req ...
MB/s Err/d Req/h ...
{
“site”: “vimeo.com”,
“port”: 80,
“unit”: “Req/s”,
“direction”: “in”,
“service”: “webapp_php”,
“server”: “webxx”
}
Carbontagger:
... service=foo.instance=host.target_type=gauge.type=calculation.unit=B 123 1234567890
…
Statsdaemon:
..unit=B..unit=B... unit=B/s→
..unit=ms..unit=ms.. unit=ms stat=mean→
→ unit=ms stat=upper_90
→ ...
GraphExplorer queries 101
site:api.vimeo.com unit=Req/s
requesthostport api_vimeo_com
Smoothing
avg over 10M
avg over ...
Aggregation, compare port 80 vs 443
avg by <dimension>
sum by <dimension>
sum by server
Compare 80 traffic amongt servers
site:api.vimeo.com unit=Req/s port=80 group by none avg over 10M
GraphExplorer queries 201
proxyserver swift server:regex upper_90 unit=ms from <datetime> to <datetime> avg over <timespec>
Compare object put/get
Stack .. http_method:(PUT|GET) swift_type=object avg by http_code,server
Comparing servers
http_method:(PUT|GET) avg by http_code,swift_type,http_method group by none
Compare http codes for GET, per swift type
http_method=GET avg by server group by swift_type
transcode unit=Job/s avg over <time> from <datetime> to <datetime>
Note: data is obfuscated
Bucketing
!queue sum by zone:apsoutheast|euwest|useast|uswest|saeast|vimeodf|vimeolv group by state
Note: data is obfuscated
Compare job states per region (zones bucket)
group by zone
Note: data is obfuscated
Unit conversion
unit=Mb/s network dfvimeorpc sum by server
unit=MB
{
server=dfvimeodfs1
plugin=diskspace
mountpoint=_srv_node_dfs5
unit=B
type=used
target_type=gauge
}
server:dfvimeodfs unit=GB type=free srv node
unit=GB/d group by mountpoint
Dashboard definition
queries = [
'cpu usage sum by core',
'mem unit=B !total group by type:swap',
'stack network unit=b/s',
'unit=B (free|used) group by =mountpoint'
]
stats.dfvimeocliapp2.twitter.error
{
“n1”: “dfvimeocliapp2”,
“n2”: “twitter”,
“n3”: “error”,
“plugin”: “catchall_statsd”,
“source”: “statsd”,
“target_type”: “rate”,
“unit”: “unknown/s”
}
Two hard things in computer science
stats.gauges.files.
id_boundary_7day
stats.gauges.files.
id_boundary_ceil
unit=File id_boundary_7d
{
“unit”: “File”,
“n1”: “id_boundary_7d”,
}
{
“intrinsic”: {
“site”: “vimeo.com”,
“unit”: “Req/s”
},
“extrinsic”: {
“agent”: “diamond”,
“processed_by”: “statsd1”,
“src”: “index.php:135”,
“replaces”: “vimeo_com_reqps”
}
}
site=vimeo.com unit=Req/s \
processed_by=statsd1 \ src=index.php:135 added_by=dieter \
123 1234567890
Equivalence
servers.host.cpu.total.iowait “core” : “_sum_”→
servers.host.cpu.<corenumber>.iowait
servers.host.loadavg.15
Rollups & aggregation
/etc/carbon/storageaggregation.conf[min]
pattern = \.min$
aggregationMethod = min
[max]
pattern = \.max$
aggregationMethod = max
[sum]
pattern = \.count$
aggregationMethod = sum
[default_average]
pattern = .*
aggregationMethod = average
2 kinds of graphite users
Selfdescribing metrics
stat=upper/lower/mean/...target_type=counter..
● stats.timers.render_time.histogram.bin_0.01● stats.timers.render_time.histogram.bin_0.1● stats.timers.render_time.histogram.bin_1 unit=Freq_abs bin_upper=1→
● stats.timers.render_time.histogram.bin_10● stats.timers.render_time.histogram.bin_50● stats.timers.render_time.histogram.bin_inf● stats.timers.render_time.lower unit=ms stat=lower→
● stats.timers.render_time.mean unit=ms stat=mean→
● stats.timers.render_time.mean_90 ...→
● stats.timers.render_time.median● stats.timers.render_time.std● stats.timers.render_time.upper● stats.timers.render_time.upper_90
Also..
● graphite API functions such as "cumulative", "summarize" and "smartSummarize"
● Graph renderers
From: dygraphs.com
Facet based suggestions
Metric types
● gauge● count & rate● counter● timer
gauge
● Multiple values in same interval● “sticky”
Count & Rate
Counter
Timer..
http://janabeck.com/blog/2012/10/12/lessonslearnedfrom100/
Timer..
● What should a metric be?● Stickyness?● Behavior on no packets received● Behavior on multiple packets received
My personal takeaways
Conclusion● Building graphs, setting up alerting cumbersome● Esp. changing information needs (troubleshooting, exploring, ..)● Esp. Complicated information needs
→ PAIN
● Structuring metrics● Selfdescribing metrics● Standardized metrics● Native metrics 2.0
● → BREEZE
Conclusion
● Metrics can be so much more usable and useful. Let's talk about tagging, standardisation, retaining information throughout the pipeline.
● Converting information needs into graph defs, alerting rules● GraphExplorer, carbontagger, statsdaemon, …● Graphiteng (native metrics 2.0)● Metrics 2.0 in your apps, agents, aggregators?● Build out structured metrics library
github.com/vimeo
github.com/Dieterbe
twitter.com/Dieter_be
dieter.plaetinck.be