Metrics stack 2.0

131
 

description

Most metrics systems link timeseries to a string key, some add a few tags. They often lack information, use inconsistent formats and terminology, and are poorly organized. As the amount of people and software generating, processing, storing and visualizing metrics grows, this approach becomes very cumbersome and there is a lot to be gained from taking a step back and re-thinking metric identifiers and metadata. Metrics 2.0 is a set of conventions around metrics: With barely any extra work metrics become self-describing and standardized. Compatibility between tools increases dramatically, dashboards can automatically convert information needs into graphs, graph renderers can present data more usefully, anomaly detection and aggregators can work more autonomously and avoid common mistakes. Result: less micromanaging of software and configuration, quicker results, more clarity. Less frustration and room for errors. This talk will also cover the tools that turn this concept into production-ready reality: Graph-Explorer is an application that integrates with Graphite. Enter an expression that represents an information need and it generates the corresponding graphs or alerting rules, automatically applying unit conversion, aggregation, processing, etc. Statsdaemon is an aggregation daemon like Etsy's Statsd that expresses performed aggregations and statistical operations by updating the metrics tags, making sure that the metric metadata always corresponds to the data. Dieter Plaetinck is a systems-gone-backend engineer at Vimeo.

Transcript of Metrics stack 2.0

Page 1: Metrics stack 2.0

   

Page 2: Metrics stack 2.0

   Credit: user niteroi @ panoramio.com

Page 3: Metrics stack 2.0

   

vimeo.com/43800150

Page 4: Metrics stack 2.0

   

Page 5: Metrics stack 2.0

   

Page 6: Metrics stack 2.0

   

Page 7: Metrics stack 2.0

   

Page 8: Metrics stack 2.0

   

Page 9: Metrics stack 2.0

   

Page 10: Metrics stack 2.0

   

Page 11: Metrics stack 2.0

   

1  Metrics 2.0 concepts

2  Implementation

3  Advanced stuff

Page 12: Metrics stack 2.0

   

“Dieter” ?

Page 13: Metrics stack 2.0

   

Peter   Deter→

Page 14: Metrics stack 2.0

   

Terminology sync

Page 15: Metrics stack 2.0

   

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

Page 16: Metrics stack 2.0

   

Page 17: Metrics stack 2.0

   

How many pagerequests/s is vimeo.com doing?

Page 18: Metrics stack 2.0

   

● stats.hits.vimeo_com

● stats_counts.hits.vimeo_com

Page 19: Metrics stack 2.0

   

Page 20: Metrics stack 2.0

   

stats.<host>.requesthostport.vimeo_com_443

Page 21: Metrics stack 2.0

   

stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90

Page 22: Metrics stack 2.0

   

O(X*Y*Z)X = # apps                

Y = # people             

Z = # aggregators     

Page 23: Metrics stack 2.0

   

How long does it take to retrieve an object from swift?

Page 24: Metrics stack 2.0

   

stats.timers.<host>.proxy­server.<swift_type>.<http_method>.<http_code>.timing.<stat>

stats.timers.<host>.object­server.<http_method>.timing.<stat>

target=stats.timers.dfs*.object*GET*timing.mean ?

target=groupByNode(stats.timers.dfs*.proxy­server.object.GET.*.timing.mean,2,"avg")

target=stats.timers.dfs*.object­server.GET.timing.mean

Page 25: Metrics stack 2.0

   

swift_type=object stat=mean timing GET avg by http_code

Page 26: Metrics stack 2.0

   

Page 27: Metrics stack 2.0

   

Page 28: Metrics stack 2.0

   

O((DxV)^2)D = # dimensions             

V = # values per dim             

Page 29: Metrics stack 2.0

   

collectd.db.disk.sda1.disk_time.write

Page 30: Metrics stack 2.0

   

Page 31: Metrics stack 2.0

   

Page 32: Metrics stack 2.0

   

What should I name my metric?

Page 33: Metrics stack 2.0

   

101001000

100001000001000000

Page 34: Metrics stack 2.0

   

Page 35: Metrics stack 2.0

   

Metrics 2.0

Page 36: Metrics stack 2.0

   

Old:● information lacking

● fields unclear & inconsistent

● cumbersome strings / trees

● forbidden characters

New:● Self­describing

● Standardized

● all dimensions in orthogonal tag­space

● Allow some useful characters

Page 37: Metrics stack 2.0

   

stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90

{    “server”: “dfvimeodfsproxy5”,    “http_method”: “GET”,    “http_code”: “200”,    “unit”: “ms”,    “target_type”: “gauge”,    “stat”: “upper_90”,    “swift_type”: “object”    “plugin”: “swift_proxy_server”}

Page 38: Metrics stack 2.0

   

Main advantages:● Immediate understanding of metric meaning (ideally)

● Minimize time to graphs, dashboards, alerting rules 

Page 39: Metrics stack 2.0

   

github.com/vimeo/graph­explorer/wiki

Page 40: Metrics stack 2.0

   

SI + IEC

B   Err   Warn   Conn   Job   File   Req    ...

MB/s   Err/d   Req/h   ...

Page 41: Metrics stack 2.0

   

{

    “site”: “vimeo.com”,

    “port”: 80,

    “unit”: “Req/s”,

    “direction”: “in”,

    “service”: “webapp_php”,

    “server”:  “webxx”

}

Page 42: Metrics stack 2.0

   

Page 43: Metrics stack 2.0

   

Carbon­tagger:

... service=foo.instance=host.target_type=gauge.type=calculation.unit=B 123 1234567890

Statsdaemon:

..unit=B..unit=B...        unit=B/s→

..unit=ms..unit=ms..    unit=ms stat=mean→

                                   → unit=ms stat=upper_90

                                   → ...

Page 44: Metrics stack 2.0

   

Page 45: Metrics stack 2.0

   

Page 46: Metrics stack 2.0

   

Graph­Explorer queries 101

site:api.vimeo.com unit=Req/s

requesthostport api_vimeo_com

Page 47: Metrics stack 2.0

   

Page 48: Metrics stack 2.0

   

Smoothing

avg over 10M

avg over ...

Page 49: Metrics stack 2.0

   

Page 50: Metrics stack 2.0

   

Aggregation, compare port 80 vs 443

avg by <dimension>

sum by <dimension>

sum by server

Page 51: Metrics stack 2.0

   

Page 52: Metrics stack 2.0

   

Compare 80 traffic amongt servers

site:api.vimeo.com unit=Req/s port=80 group by none avg over 10M

Page 53: Metrics stack 2.0

   

Page 54: Metrics stack 2.0

   

Graph­Explorer queries 201

proxy­server swift server:regex upper_90 unit=ms from <datetime> to <datetime> avg over <timespec> 

Page 55: Metrics stack 2.0

   

Page 56: Metrics stack 2.0

   

Page 57: Metrics stack 2.0

   

Page 58: Metrics stack 2.0

   

Page 59: Metrics stack 2.0

   

Compare object put/get

Stack .. http_method:(PUT|GET) swift_type=object avg by http_code,server

Page 60: Metrics stack 2.0

   

Page 61: Metrics stack 2.0

   

Comparing servers

http_method:(PUT|GET) avg by http_code,swift_type,http_method group by none

Page 62: Metrics stack 2.0

   

Page 63: Metrics stack 2.0

   

Compare http codes for GET, per swift type

http_method=GET avg by server group by swift_type

Page 64: Metrics stack 2.0

   

Page 65: Metrics stack 2.0

   

transcode unit=Job/s avg over <time> from <datetime> to <datetime>

Page 66: Metrics stack 2.0

    Note: data is obfuscated

Page 67: Metrics stack 2.0

   

Bucketing

!queue sum by zone:ap­southeast|eu­west|us­east|us­west|sa­east|vimeo­df|vimeo­lv group by state

Page 68: Metrics stack 2.0

    Note: data is obfuscated

Page 69: Metrics stack 2.0

   

Compare job states per region (zones bucket)

group by zone

Page 70: Metrics stack 2.0

    Note: data is obfuscated

Page 71: Metrics stack 2.0

   

Unit conversion

unit=Mb/s network dfvimeorpc sum by server

Page 72: Metrics stack 2.0

   

Page 73: Metrics stack 2.0

   

Page 74: Metrics stack 2.0

   

unit=MB

Page 75: Metrics stack 2.0

   

Page 76: Metrics stack 2.0

   

Page 77: Metrics stack 2.0

   

{

    server=dfvimeodfs1

    plugin=diskspace

    mountpoint=_srv_node_dfs5

    unit=B

    type=used

    target_type=gauge

}

Page 78: Metrics stack 2.0

   

server:dfvimeodfs unit=GB type=free srv node

Page 79: Metrics stack 2.0

   

Page 80: Metrics stack 2.0

   

unit=GB/d group by mountpoint

Page 81: Metrics stack 2.0

   

Page 82: Metrics stack 2.0

   

Page 83: Metrics stack 2.0

   

Page 84: Metrics stack 2.0

   

Page 85: Metrics stack 2.0

   

Page 86: Metrics stack 2.0

   

Page 87: Metrics stack 2.0

   

Dashboard definition

 queries = [

   'cpu usage sum by core',

   'mem unit=B !total group by type:swap',

   'stack network unit=b/s',

   'unit=B (free|used) group by =mountpoint'

 ]

Page 88: Metrics stack 2.0

   

Page 89: Metrics stack 2.0

   

stats.dfvimeocliapp2.twitter.error

{

    “n1”: “dfvimeocliapp2”,

    “n2”: “twitter”,

    “n3”: “error”,

    “plugin”: “catchall_statsd”,

    “source”: “statsd”,

    “target_type”: “rate”,

    “unit”: “unknown/s”

}

Page 90: Metrics stack 2.0

   

Two hard things in computer science

Page 91: Metrics stack 2.0

   

stats.gauges.files.

id_boundary_7day

stats.gauges.files.

id_boundary_ceil

Page 92: Metrics stack 2.0

   

unit=File id_boundary_7d 

{

   “unit”: “File”,

   “n1”: “id_boundary_7d”,

}

Page 93: Metrics stack 2.0

   

{

    “intrinsic”: {

        “site”: “vimeo.com”,

        “unit”: “Req/s”

    },

    “extrinsic”: {

        “agent”: “diamond”,

        “processed_by”: “statsd1”,

        “src”: “index.php:135”,

        “replaces”: “vimeo_com_reqps”

    }

}

Page 94: Metrics stack 2.0

   

site=vimeo.com unit=Req/s \

  processed_by=statsd1 \ src=index.php:135 added_by=dieter \

123 1234567890

Page 95: Metrics stack 2.0

   

Page 96: Metrics stack 2.0

   

Equivalence

servers.host.cpu.total.iowait   “core” : “_sum_”→

servers.host.cpu.<core­number>.iowait

servers.host.loadavg.15

Page 97: Metrics stack 2.0

   

Rollups & aggregation

Page 98: Metrics stack 2.0

   

/etc/carbon/storage­aggregation.conf[min]

pattern = \.min$

aggregationMethod = min

[max]

pattern = \.max$

aggregationMethod = max

[sum]

pattern = \.count$

aggregationMethod = sum

[default_average]

pattern = .*

aggregationMethod = average

Page 99: Metrics stack 2.0

   

Page 100: Metrics stack 2.0

   

2 kinds of graphite users

Page 101: Metrics stack 2.0

   

Self­describing metrics

stat=upper/lower/mean/...target_type=counter..

Page 102: Metrics stack 2.0

   

●    stats.timers.render_time.histogram.bin_0.01●    stats.timers.render_time.histogram.bin_0.1●    stats.timers.render_time.histogram.bin_1           unit=Freq_abs bin_upper=1→

●    stats.timers.render_time.histogram.bin_10●    stats.timers.render_time.histogram.bin_50●    stats.timers.render_time.histogram.bin_inf●    stats.timers.render_time.lower                            unit=ms stat=lower→

●    stats.timers.render_time.mean                            unit=ms stat=mean→

●    stats.timers.render_time.mean_90                      ...→

●    stats.timers.render_time.median●    stats.timers.render_time.std●    stats.timers.render_time.upper●    stats.timers.render_time.upper_90

Page 103: Metrics stack 2.0

   

Also..

● graphite API functions such as "cumulative", "summarize" and "smartSummarize"

● Graph renderers

Page 104: Metrics stack 2.0

   

Page 105: Metrics stack 2.0

   From: dygraphs.com

Page 106: Metrics stack 2.0

   

Page 107: Metrics stack 2.0

   

Page 108: Metrics stack 2.0

   

Page 109: Metrics stack 2.0

   

Page 110: Metrics stack 2.0

   

Page 111: Metrics stack 2.0

   

Facet based suggestions

Page 112: Metrics stack 2.0

   

Page 113: Metrics stack 2.0

   

Metric types

● gauge● count & rate● counter● timer

Page 114: Metrics stack 2.0

   

Page 115: Metrics stack 2.0

   

Page 116: Metrics stack 2.0

   

Page 117: Metrics stack 2.0

   

Page 118: Metrics stack 2.0

   

gauge

● Multiple values in same interval● “sticky”

Page 119: Metrics stack 2.0

   

Page 120: Metrics stack 2.0

   

Count & Rate

Page 121: Metrics stack 2.0

   

Counter

Page 122: Metrics stack 2.0

   

Timer..

Page 123: Metrics stack 2.0

   

Page 124: Metrics stack 2.0

   http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/

Page 125: Metrics stack 2.0

   

Timer..

Page 126: Metrics stack 2.0

   

● What should a metric be?● Stickyness?● Behavior on no packets received● Behavior on multiple packets received

Page 127: Metrics stack 2.0

   

My personal takeaways

Page 128: Metrics stack 2.0

   

Conclusion● Building graphs, setting up alerting cumbersome● Esp. changing information needs (troubleshooting, exploring, ..)● Esp. Complicated information needs 

  → PAIN

● Structuring metrics● Self­describing metrics● Standardized metrics● Native metrics 2.0

●  → BREEZE 

Page 129: Metrics stack 2.0

   

Conclusion

● Metrics can be so much more usable and useful. Let's talk about tagging, standardisation, retaining information throughout the pipeline.

● Converting information needs into graph defs, alerting rules● Graph­Explorer, carbon­tagger, statsdaemon, …● Graphite­ng (native metrics 2.0)● Metrics 2.0 in your apps, agents, aggregators?● Build out structured metrics library

Page 130: Metrics stack 2.0

   

github.com/vimeo

github.com/Dieterbe

twitter.com/Dieter_be

dieter.plaetinck.be

Page 131: Metrics stack 2.0