Rethinking metrics: metrics 2.0 @ Lisa 2014

111
rethinking metrics: metrics 2.0

description

As the amount of metrics, software that produce and process them, and people involved in them continue to increase, we need better ways to organize them, to make them self-describing, and do so in a way that is consistent. Leveraging this, we can then automatically build graphs and dashboards, given a query that represents an information need, even for complicated cases. We can build richer visualizations, alerting and fault detection. This talk will introduce the concepts and related tools, demonstrate possibilities using the Graph-Explorer interface, and lay the groundwork for future work.

Transcript of Rethinking metrics: metrics 2.0 @ Lisa 2014

Page 1: Rethinking metrics: metrics 2.0 @ Lisa 2014

rethinking metrics:

metrics 2.0

Page 2: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 3: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 4: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 5: Rethinking metrics: metrics 2.0 @ Lisa 2014

instagram.com/wrongrob

Page 6: Rethinking metrics: metrics 2.0 @ Lisa 2014

vimeo.com/43800150

Page 7: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 8: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 9: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 10: Rethinking metrics: metrics 2.0 @ Lisa 2014

problems

Metrics 2.0 concepts

implementations uses & ideas

Page 11: Rethinking metrics: metrics 2.0 @ Lisa 2014

Mostly

graphite

Page 12: Rethinking metrics: metrics 2.0 @ Lisa 2014

terminology

sync

Page 13: Rethinking metrics: metrics 2.0 @ Lisa 2014

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

Page 14: Rethinking metrics: metrics 2.0 @ Lisa 2014

Problems

Page 15: Rethinking metrics: metrics 2.0 @ Lisa 2014

Vimeo.com pagerequests/s?

server X disk write?

Page 16: Rethinking metrics: metrics 2.0 @ Lisa 2014

stats.hits.vimeo_com

stats_counts.hits.vimeo_com

stats.*.vimeo_requests

collectd.db.disk.sda1.disk_time.write

Page 17: Rethinking metrics: metrics 2.0 @ Lisa 2014

Terminology? Meaning?

Prefix?

Unit?

Aggregation?

Source?

Understanding metrics

Page 18: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 19: Rethinking metrics: metrics 2.0 @ Lisa 2014

Unclear, inconsistent terminology, format

tightly coupled

lack information

Page 20: Rethinking metrics: metrics 2.0 @ Lisa 2014

http://litlquest.com/forest-trees/see-forest-trees-2

Page 21: Rethinking metrics: metrics 2.0 @ Lisa 2014

O(S*P*A*C) S = # Sources

P = # People

A = # Aggregators

C = #Complexity

Page 22: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 23: Rethinking metrics: metrics 2.0 @ Lisa 2014

Graphs and dashboards are a huge time sink.

Page 24: Rethinking metrics: metrics 2.0 @ Lisa 2014

metrics 2.0

concepts

Page 25: Rethinking metrics: metrics 2.0 @ Lisa 2014

Self-describing

Standardized

Orthogonal dimensions

Page 26: Rethinking metrics: metrics 2.0 @ Lisa 2014

stats.timers.dfs5.proxy-server.object.GET.200.

timing.upper_90

Page 27: Rethinking metrics: metrics 2.0 @ Lisa 2014

{

server: dfvimeodfsproxy5,

http_method: GET,

http_code: 200,

unit: ms,

metric_type: gauge,

stat: upper_90,

swift_type: object

}

Page 28: Rethinking metrics: metrics 2.0 @ Lisa 2014

SI + IEC

B Err Warn ConnFile Req …

MB/s Err/dReq/h ...

Page 29: Rethinking metrics: metrics 2.0 @ Lisa 2014

allow more characters

unit: Req/s, site: vimeo.com, ...

Page 30: Rethinking metrics: metrics 2.0 @ Lisa 2014

Metadata

meta: {

src: proxy.py:458,

from: diamond

}

Page 31: Rethinking metrics: metrics 2.0 @ Lisa 2014

metrics20.org

Page 32: Rethinking metrics: metrics 2.0 @ Lisa 2014

Immediate understanding

of metrics

Minimize time to graphs,

alerting, troubleshooting

compatibility & flexibility

in tooling

Page 33: Rethinking metrics: metrics 2.0 @ Lisa 2014

Implementations getting the data

Page 34: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 35: Rethinking metrics: metrics 2.0 @ Lisa 2014

Source formats

…service=foo instance=host unit=B 123 1234567890

{s}foo.{i}host.{u}B 123 1234567890

<uuid> 125 1234567890 #seperate data

Page 36: Rethinking metrics: metrics 2.0 @ Lisa 2014

Carbon-tagger

…stats.gauges.host.foo 125 1234567890

service=foo instance=host target_type=gauge unit=B 123 1234567890

Page 37: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 38: Rethinking metrics: metrics 2.0 @ Lisa 2014

Statsdaemon

unit=B

unit=B

...

unit=ms

unit=ms

...

unit=B/s

unit=ms stat=meanunit=ms stat=upper_90...

Page 39: Rethinking metrics: metrics 2.0 @ Lisa 2014

Keep metric

tags in sync with data

Page 40: Rethinking metrics: metrics 2.0 @ Lisa 2014

Implementations

Graphing & dashboarding

Visualization

Alerting

Page 41: Rethinking metrics: metrics 2.0 @ Lisa 2014

Graphing &Dashboarding

Page 42: Rethinking metrics: metrics 2.0 @ Lisa 2014

GraphExplorer

Page 43: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 44: Rethinking metrics: metrics 2.0 @ Lisa 2014

Graph-Explorer queries 101

proxy-server swift server:regex unit=ms

(AND)

Page 45: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 46: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 47: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 48: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 49: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 50: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 51: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 52: Rethinking metrics: metrics 2.0 @ Lisa 2014

upper_90 (or stat=upper_90)

from <datetime>to <datetime>

avg over <timespec>(5M, 1h, 3d, ...)

Page 53: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 54: Rethinking metrics: metrics 2.0 @ Lisa 2014

Compare object put/get

stack …

http_method:(PUT|GET)

swift_type=object

avg by http_code,server

Page 55: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 56: Rethinking metrics: metrics 2.0 @ Lisa 2014

Comparing servers

http_method:(PUT|GET)

group by unit,target_type

avg by http_code,swift_type,http_method

Page 57: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 58: Rethinking metrics: metrics 2.0 @ Lisa 2014

transcode unit=Job/savg over <time>

from <datetime> to <datetime>

Page 59: Rethinking metrics: metrics 2.0 @ Lisa 2014

Note: data is obfuscated

Page 60: Rethinking metrics: metrics 2.0 @ Lisa 2014

Bucketing

sum by zone:eu-west|us-east|ap-southeast|us-west|

sa-east|vimeo-df|vimeo-lv

group by state

Page 61: Rethinking metrics: metrics 2.0 @ Lisa 2014

Note: data is obfuscated

Page 62: Rethinking metrics: metrics 2.0 @ Lisa 2014

Compare job states per region

group by zone

Page 63: Rethinking metrics: metrics 2.0 @ Lisa 2014

Note: data is obfuscated

Page 64: Rethinking metrics: metrics 2.0 @ Lisa 2014

Unit conversion

unit=Mb/s network server:regexsum by server

Page 65: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 66: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 67: Rethinking metrics: metrics 2.0 @ Lisa 2014

Integration

Metric unit=B/s Query unit=TB

Page 68: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 69: Rethinking metrics: metrics 2.0 @ Lisa 2014

Deriving

Metric unit=BQuery unit=GB/d

Page 70: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 71: Rethinking metrics: metrics 2.0 @ Lisa 2014

Highly extensible

Equal rights for all tags

→ real world use drives spec

Page 72: Rethinking metrics: metrics 2.0 @ Lisa 2014

SI + IEC

B Err Conn File Req

Anything

Err/s Anything/sMAnything/d

Page 73: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 74: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 75: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 76: Rethinking metrics: metrics 2.0 @ Lisa 2014

Minimize time from information need toinsights.

Page 77: Rethinking metrics: metrics 2.0 @ Lisa 2014

Future Work

Page 78: Rethinking metrics: metrics 2.0 @ Lisa 2014

Faced-based suggestions

Custom hierachies

Page 79: Rethinking metrics: metrics 2.0 @ Lisa 2014

Tag insights

Page 80: Rethinking metrics: metrics 2.0 @ Lisa 2014

● Storage aggregation rules

● graphite API functions such as cumulative, summarize and smartSummarize

●consolidateBy & Graph renderers

Page 81: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 82: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 83: Rethinking metrics: metrics 2.0 @ Lisa 2014

stat=upper/lower/mean/...(assume avg otherwise)

Page 84: Rethinking metrics: metrics 2.0 @ Lisa 2014

Visualizations

Page 85: Rethinking metrics: metrics 2.0 @ Lisa 2014

From: dygraphs.com

Page 86: Rethinking metrics: metrics 2.0 @ Lisa 2014

bin=10

bin=20

bin=30

bin=40

bin=50

bin=100

Page 87: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 88: Rethinking metrics: metrics 2.0 @ Lisa 2014

Alerting

Page 89: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 90: Rethinking metrics: metrics 2.0 @ Lisa 2014

unit=Err/s

Page 91: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 92: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 93: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 94: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 95: Rethinking metrics: metrics 2.0 @ Lisa 2014

Classifying clusters of

cause & effect

Page 96: Rethinking metrics: metrics 2.0 @ Lisa 2014

Different algos for different

metric categories

Page 97: Rethinking metrics: metrics 2.0 @ Lisa 2014

Alert criticality & routing based

on tags

Page 98: Rethinking metrics: metrics 2.0 @ Lisa 2014

integrating logs & metrics

Page 99: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 100: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 101: Rethinking metrics: metrics 2.0 @ Lisa 2014

Algorithms leverage both

logs and metrics

Page 102: Rethinking metrics: metrics 2.0 @ Lisa 2014

Conclusion

structuredself-describing standardized

metrics = enabler

Page 103: Rethinking metrics: metrics 2.0 @ Lisa 2014

Conclusion

Concerns? Ideas? Advice?

Ready for early adopters!

Work with me on next-gen telemetry!

Page 104: Rethinking metrics: metrics 2.0 @ Lisa 2014

Seen in this presentation:

metrics20.org

vimeo.github.io/graph-explorer

github.com/vimeo/timeserieswidget

github.com/vimeo/carbon-tagger

github.com/vimeo/statsdaemon

github.com/graphite-ng/carbon-relay-ng

github.com/Dieterbe/anthracite

Page 105: Rethinking metrics: metrics 2.0 @ Lisa 2014

You might also like:

github.com/vimeo/graphite-influxdbgithub.com/vimeo/graphite-api-influxdb-dockerGithub.com/vimeo/whisper-to-influxdb

github.com/Dieterbe/influx-cli

github.com/graphite-ng/graphite-ng

Github.com/vimeo/smoketcpGithub.com/vimeo/tailgate

Page 106: Rethinking metrics: metrics 2.0 @ Lisa 2014

Stay in touch!

Metrics20 google groupit-telemetry google group

twitter.com/[email protected]@vimeo.com

Lisa labs office hours after lunch

Q & A

Page 107: Rethinking metrics: metrics 2.0 @ Lisa 2014

Bonus round

Page 108: Rethinking metrics: metrics 2.0 @ Lisa 2014

Dashboard definition

queries = [

'cpu usage sum by core',

'mem unit=B !total group by type:swap',

'stack network unit=Mb/s',

'unit=B (free|used) group by =mountpoint'

]

Page 109: Rethinking metrics: metrics 2.0 @ Lisa 2014
Page 110: Rethinking metrics: metrics 2.0 @ Lisa 2014

Catchall plugins

stats.dfvimeocliapp2.twitter.error

{

“n1”: “dfvimeocliapp2”,

“n2”: “twitter”,

“n3”: “error”,

“plugin”: “catchall_statsd”,

“source”: “statsd”,

“target_type”: “rate”,

“unit”: “unknown/s”

}

Page 111: Rethinking metrics: metrics 2.0 @ Lisa 2014

Equivalence

servers.host.cpu.total.iowait → “core” : “_sum_”

servers.host.cpu.<core-number>.iowait

servers.host.loadavg.15