Download - Metrics stack 2.0

Transcript
Page 1: Metrics stack 2.0

   

Page 2: Metrics stack 2.0

   Credit: user niteroi @ panoramio.com

Page 3: Metrics stack 2.0

   

vimeo.com/43800150

Page 4: Metrics stack 2.0

   

Page 5: Metrics stack 2.0

   

Page 6: Metrics stack 2.0

   

Page 7: Metrics stack 2.0

   

Page 8: Metrics stack 2.0

   

Page 9: Metrics stack 2.0

   

Page 10: Metrics stack 2.0

   

Page 11: Metrics stack 2.0

   

1  Metrics 2.0 concepts

2  Implementation

3  Advanced stuff

Page 12: Metrics stack 2.0

   

“Dieter” ?

Page 13: Metrics stack 2.0

   

Peter   Deter→

Page 14: Metrics stack 2.0

   

Terminology sync

Page 15: Metrics stack 2.0

   

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

Page 16: Metrics stack 2.0

   

Page 17: Metrics stack 2.0

   

How many pagerequests/s is vimeo.com doing?

Page 18: Metrics stack 2.0

   

● stats.hits.vimeo_com

● stats_counts.hits.vimeo_com

Page 19: Metrics stack 2.0

   

Page 20: Metrics stack 2.0

   

stats.<host>.requesthostport.vimeo_com_443

Page 21: Metrics stack 2.0

   

stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90

Page 22: Metrics stack 2.0

   

O(X*Y*Z)X = # apps                

Y = # people             

Z = # aggregators     

Page 23: Metrics stack 2.0

   

How long does it take to retrieve an object from swift?

Page 24: Metrics stack 2.0

   

stats.timers.<host>.proxy­server.<swift_type>.<http_method>.<http_code>.timing.<stat>

stats.timers.<host>.object­server.<http_method>.timing.<stat>

target=stats.timers.dfs*.object*GET*timing.mean ?

target=groupByNode(stats.timers.dfs*.proxy­server.object.GET.*.timing.mean,2,"avg")

target=stats.timers.dfs*.object­server.GET.timing.mean

Page 25: Metrics stack 2.0

   

swift_type=object stat=mean timing GET avg by http_code

Page 26: Metrics stack 2.0

   

Page 27: Metrics stack 2.0

   

Page 28: Metrics stack 2.0

   

O((DxV)^2)D = # dimensions             

V = # values per dim             

Page 29: Metrics stack 2.0

   

collectd.db.disk.sda1.disk_time.write

Page 30: Metrics stack 2.0

   

Page 31: Metrics stack 2.0

   

Page 32: Metrics stack 2.0

   

What should I name my metric?

Page 33: Metrics stack 2.0

   

101001000

100001000001000000

Page 34: Metrics stack 2.0

   

Page 35: Metrics stack 2.0

   

Metrics 2.0

Page 36: Metrics stack 2.0

   

Old:● information lacking

● fields unclear & inconsistent

● cumbersome strings / trees

● forbidden characters

New:● Self­describing

● Standardized

● all dimensions in orthogonal tag­space

● Allow some useful characters

Page 37: Metrics stack 2.0

   

stats.timers.dfs5.proxy­server.object.GET.200.timing.upper_90

{    “server”: “dfvimeodfsproxy5”,    “http_method”: “GET”,    “http_code”: “200”,    “unit”: “ms”,    “target_type”: “gauge”,    “stat”: “upper_90”,    “swift_type”: “object”    “plugin”: “swift_proxy_server”}

Page 38: Metrics stack 2.0

   

Main advantages:● Immediate understanding of metric meaning (ideally)

● Minimize time to graphs, dashboards, alerting rules 

Page 39: Metrics stack 2.0

   

github.com/vimeo/graph­explorer/wiki

Page 40: Metrics stack 2.0

   

SI + IEC

B   Err   Warn   Conn   Job   File   Req    ...

MB/s   Err/d   Req/h   ...

Page 41: Metrics stack 2.0

   

{

    “site”: “vimeo.com”,

    “port”: 80,

    “unit”: “Req/s”,

    “direction”: “in”,

    “service”: “webapp_php”,

    “server”:  “webxx”

}

Page 42: Metrics stack 2.0

   

Page 43: Metrics stack 2.0

   

Carbon­tagger:

... service=foo.instance=host.target_type=gauge.type=calculation.unit=B 123 1234567890

Statsdaemon:

..unit=B..unit=B...        unit=B/s→

..unit=ms..unit=ms..    unit=ms stat=mean→

                                   → unit=ms stat=upper_90

                                   → ...

Page 44: Metrics stack 2.0

   

Page 45: Metrics stack 2.0

   

Page 46: Metrics stack 2.0

   

Graph­Explorer queries 101

site:api.vimeo.com unit=Req/s

requesthostport api_vimeo_com

Page 47: Metrics stack 2.0

   

Page 48: Metrics stack 2.0

   

Smoothing

avg over 10M

avg over ...

Page 49: Metrics stack 2.0

   

Page 50: Metrics stack 2.0

   

Aggregation, compare port 80 vs 443

avg by <dimension>

sum by <dimension>

sum by server

Page 51: Metrics stack 2.0

   

Page 52: Metrics stack 2.0

   

Compare 80 traffic amongt servers

site:api.vimeo.com unit=Req/s port=80 group by none avg over 10M

Page 53: Metrics stack 2.0

   

Page 54: Metrics stack 2.0

   

Graph­Explorer queries 201

proxy­server swift server:regex upper_90 unit=ms from <datetime> to <datetime> avg over <timespec> 

Page 55: Metrics stack 2.0

   

Page 56: Metrics stack 2.0

   

Page 57: Metrics stack 2.0

   

Page 58: Metrics stack 2.0

   

Page 59: Metrics stack 2.0

   

Compare object put/get

Stack .. http_method:(PUT|GET) swift_type=object avg by http_code,server

Page 60: Metrics stack 2.0

   

Page 61: Metrics stack 2.0

   

Comparing servers

http_method:(PUT|GET) avg by http_code,swift_type,http_method group by none

Page 62: Metrics stack 2.0

   

Page 63: Metrics stack 2.0

   

Compare http codes for GET, per swift type

http_method=GET avg by server group by swift_type

Page 64: Metrics stack 2.0

   

Page 65: Metrics stack 2.0

   

transcode unit=Job/s avg over <time> from <datetime> to <datetime>

Page 66: Metrics stack 2.0

    Note: data is obfuscated

Page 67: Metrics stack 2.0

   

Bucketing

!queue sum by zone:ap­southeast|eu­west|us­east|us­west|sa­east|vimeo­df|vimeo­lv group by state

Page 68: Metrics stack 2.0

    Note: data is obfuscated

Page 69: Metrics stack 2.0

   

Compare job states per region (zones bucket)

group by zone

Page 70: Metrics stack 2.0

    Note: data is obfuscated

Page 71: Metrics stack 2.0

   

Unit conversion

unit=Mb/s network dfvimeorpc sum by server

Page 72: Metrics stack 2.0

   

Page 73: Metrics stack 2.0

   

Page 74: Metrics stack 2.0

   

unit=MB

Page 75: Metrics stack 2.0

   

Page 76: Metrics stack 2.0

   

Page 77: Metrics stack 2.0

   

{

    server=dfvimeodfs1

    plugin=diskspace

    mountpoint=_srv_node_dfs5

    unit=B

    type=used

    target_type=gauge

}

Page 78: Metrics stack 2.0

   

server:dfvimeodfs unit=GB type=free srv node

Page 79: Metrics stack 2.0

   

Page 80: Metrics stack 2.0

   

unit=GB/d group by mountpoint

Page 81: Metrics stack 2.0

   

Page 82: Metrics stack 2.0

   

Page 83: Metrics stack 2.0

   

Page 84: Metrics stack 2.0

   

Page 85: Metrics stack 2.0

   

Page 86: Metrics stack 2.0

   

Page 87: Metrics stack 2.0

   

Dashboard definition

 queries = [

   'cpu usage sum by core',

   'mem unit=B !total group by type:swap',

   'stack network unit=b/s',

   'unit=B (free|used) group by =mountpoint'

 ]

Page 88: Metrics stack 2.0

   

Page 89: Metrics stack 2.0

   

stats.dfvimeocliapp2.twitter.error

{

    “n1”: “dfvimeocliapp2”,

    “n2”: “twitter”,

    “n3”: “error”,

    “plugin”: “catchall_statsd”,

    “source”: “statsd”,

    “target_type”: “rate”,

    “unit”: “unknown/s”

}

Page 90: Metrics stack 2.0

   

Two hard things in computer science

Page 91: Metrics stack 2.0

   

stats.gauges.files.

id_boundary_7day

stats.gauges.files.

id_boundary_ceil

Page 92: Metrics stack 2.0

   

unit=File id_boundary_7d 

{

   “unit”: “File”,

   “n1”: “id_boundary_7d”,

}

Page 93: Metrics stack 2.0

   

{

    “intrinsic”: {

        “site”: “vimeo.com”,

        “unit”: “Req/s”

    },

    “extrinsic”: {

        “agent”: “diamond”,

        “processed_by”: “statsd1”,

        “src”: “index.php:135”,

        “replaces”: “vimeo_com_reqps”

    }

}

Page 94: Metrics stack 2.0

   

site=vimeo.com unit=Req/s \

  processed_by=statsd1 \ src=index.php:135 added_by=dieter \

123 1234567890

Page 95: Metrics stack 2.0

   

Page 96: Metrics stack 2.0

   

Equivalence

servers.host.cpu.total.iowait   “core” : “_sum_”→

servers.host.cpu.<core­number>.iowait

servers.host.loadavg.15

Page 97: Metrics stack 2.0

   

Rollups & aggregation

Page 98: Metrics stack 2.0

   

/etc/carbon/storage­aggregation.conf[min]

pattern = \.min$

aggregationMethod = min

[max]

pattern = \.max$

aggregationMethod = max

[sum]

pattern = \.count$

aggregationMethod = sum

[default_average]

pattern = .*

aggregationMethod = average

Page 99: Metrics stack 2.0

   

Page 100: Metrics stack 2.0

   

2 kinds of graphite users

Page 101: Metrics stack 2.0

   

Self­describing metrics

stat=upper/lower/mean/...target_type=counter..

Page 102: Metrics stack 2.0

   

●    stats.timers.render_time.histogram.bin_0.01●    stats.timers.render_time.histogram.bin_0.1●    stats.timers.render_time.histogram.bin_1           unit=Freq_abs bin_upper=1→

●    stats.timers.render_time.histogram.bin_10●    stats.timers.render_time.histogram.bin_50●    stats.timers.render_time.histogram.bin_inf●    stats.timers.render_time.lower                            unit=ms stat=lower→

●    stats.timers.render_time.mean                            unit=ms stat=mean→

●    stats.timers.render_time.mean_90                      ...→

●    stats.timers.render_time.median●    stats.timers.render_time.std●    stats.timers.render_time.upper●    stats.timers.render_time.upper_90

Page 103: Metrics stack 2.0

   

Also..

● graphite API functions such as "cumulative", "summarize" and "smartSummarize"

● Graph renderers

Page 104: Metrics stack 2.0

   

Page 105: Metrics stack 2.0

   From: dygraphs.com

Page 106: Metrics stack 2.0

   

Page 107: Metrics stack 2.0

   

Page 108: Metrics stack 2.0

   

Page 109: Metrics stack 2.0

   

Page 110: Metrics stack 2.0

   

Page 111: Metrics stack 2.0

   

Facet based suggestions

Page 112: Metrics stack 2.0

   

Page 113: Metrics stack 2.0

   

Metric types

● gauge● count & rate● counter● timer

Page 114: Metrics stack 2.0

   

Page 115: Metrics stack 2.0

   

Page 116: Metrics stack 2.0

   

Page 117: Metrics stack 2.0

   

Page 118: Metrics stack 2.0

   

gauge

● Multiple values in same interval● “sticky”

Page 119: Metrics stack 2.0

   

Page 120: Metrics stack 2.0

   

Count & Rate

Page 121: Metrics stack 2.0

   

Counter

Page 122: Metrics stack 2.0

   

Timer..

Page 123: Metrics stack 2.0

   

Page 124: Metrics stack 2.0

   http://janabeck.com/blog/2012/10/12/lessons­learned­from­100/

Page 125: Metrics stack 2.0

   

Timer..

Page 126: Metrics stack 2.0

   

● What should a metric be?● Stickyness?● Behavior on no packets received● Behavior on multiple packets received

Page 127: Metrics stack 2.0

   

My personal takeaways

Page 128: Metrics stack 2.0

   

Conclusion● Building graphs, setting up alerting cumbersome● Esp. changing information needs (troubleshooting, exploring, ..)● Esp. Complicated information needs 

  → PAIN

● Structuring metrics● Self­describing metrics● Standardized metrics● Native metrics 2.0

●  → BREEZE 

Page 129: Metrics stack 2.0

   

Conclusion

● Metrics can be so much more usable and useful. Let's talk about tagging, standardisation, retaining information throughout the pipeline.

● Converting information needs into graph defs, alerting rules● Graph­Explorer, carbon­tagger, statsdaemon, …● Graphite­ng (native metrics 2.0)● Metrics 2.0 in your apps, agents, aggregators?● Build out structured metrics library

Page 130: Metrics stack 2.0

   

github.com/vimeo

github.com/Dieterbe

twitter.com/Dieter_be

dieter.plaetinck.be

Page 131: Metrics stack 2.0