Rethinking metrics: metrics 2.0

download Rethinking metrics: metrics 2.0

of 108

  • date post

    11-Jul-2015
  • Category

    Engineering

  • view

    220
  • download

    2

Embed Size (px)

Transcript of Rethinking metrics: metrics 2.0

  • rethinking metrics:

    metrics 2.0

  • by niteroi @ panoramio.com

  • vimeo.com/43800150

  • problems

    Metrics 2.0 concepts

    implementations uses & ideas

  • terminology

    sync

  • (1234567890, 82)

    (1234567900, 123)

    (1234567910, 109)

    (1234567920, 77)

    db15.mysql.queries_running

    host=db15 mysql.queries_running

  • Problems

  • Vimeo.com pagerequests/s?

    server X write perf?

  • stats.hits.vimeo_com

    stats_counts.hits.vimeo_com

    stats.*.vimeo_requests

    collectd.db.disk.sda1.disk_time.write

  • Terminology? Meaning?

    Prefix?

    Unit?

    Aggregation?

    Source?

    Understanding metrics

  • Unclear, inconsistent terminology, format

    tightly coupled

    lack information

  • http://litlquest.com/forest-trees/see-forest-trees-2

  • O(S*P*A*C) S = # Sources

    P = # People A = # Aggregators

    C = #Complexity

  • Graphs and dashboards are a huge time sink.

  • metrics 2.0

    concepts

  • Self-describing

    Standardized

    Orthogonal dimensions

  • stats.timers.dfs5.proxy-server.object.GET.200.

    timing.upper_90

  • {

    server: dfvimeodfsproxy5,

    http_method: GET,

    http_code: 200,

    unit: ms,

    metric_type: gauge,

    stat: upper_90,

    swift_type: object

    }

  • allow more characters

    unit: Req/s, site: vimeo.com, ...

  • Metadata

    meta: {

    src: proxy.py:458,

    from: diamond

    }

  • Datamodel

  • Any protocol

  • Source format

    service=foo instance=host unit=B 123 1234567890

    {s}foo.{i}host.{u}B 123 1234567890

    125 1234567890 #seperate data

  • metrics20.org

  • SI + IEC

    B Err Warn ConnJob File Req ...

    MB/s Err/dReq/h ...

  • Immediate understandingof metrics

    Minimize time to graphs,alerting rules, debugging

    compatibility & flexibilityin tooling

  • Implementations examples

  • Carbon-tagger

    stats.gauges.host.foo 125 1234567890

    service=foo instance=host target_type=gauge unit=B 123 1234567890

  • Statsdaemon

    unit=B

    unit=B

    ...

    unit=ms

    unit=ms

    ...

    unit=B/s

    unit=ms stat=meanunit=ms stat=upper_90...

  • Keep metric tags in sync with data

  • Graphing & dashboarding

    Visualization

    Alerting

  • Graphing &Dashboarding

  • GraphExplorer

  • Graph-Explorer queries 101

    proxy-server swift server:regex unit=ms

    (AND)

  • upper_90 (or stat=upper_90)

    from to

    avg over (5M, 1h, 3d, ...)

  • Compare object put/get

    stack

    http_method:(PUT|GET)

    swift_type=object

    avg by http_code,server

  • Comparing servers

    http_method:(PUT|GET)

    group by unit,target_type

    avg by http_code,swift_type,http_method

  • transcode unit=Job/savg over

    from to

  • Note: data is obfuscated

  • Bucketing

    sum by zone:eu-west|us-east|ap-southeast|us-west|

    sa-east|vimeo-df|vimeo-lv

    group by state

  • Note: data is obfuscated

  • Compare job states per region (zones bucket)

    group by zone

  • Note: data is obfuscated

  • Unit conversion

    unit=Mb/s network server:regexsum by server

  • Integration

    Metric unit=B/s Query unit=TB

  • Deriving

    Metric unit=BQuery unit=GB/d

  • Future work

    Faced-based suggestions

    Custom trees

  • Dashboard definition

    queries = [ 'cpu usage sum by core',

    'mem unit=B !total group by type:swap',

    'stack network unit=Mb/s',

    'unit=B (free|used) group by =mountpoint'

    ]

  • Equivalence

    servers.host.cpu.total.iowait core : _sum_servers.host.cpu..iowait

    servers.host.loadavg.15

  • Future Work

  • Storage aggregation rules

    graphite API functions such as cumulative, summarize and smartSummarize

    consolidateBy & Graph renderers

  • Self-describing & standardized

    stat=upper/lower/mean/...target_type=counter..

  • Visualizations

  • From: dygraphs.com

  • Select your view

  • bin=10

    bin=20

    bin=30

    bin=40

    bin=50

    bin=100

  • Alerting

  • unit=Err/s

  • Automatic cause & effect

  • Different algo's for different

    things

  • Alert criticality & routing based

    on tags

  • integrating logs & metrics

  • Algorithms leverage both

    logs and metrics

  • Changing software

  • Conclusion

    structuredself-describing standardizedmetrics = enabler

  • Conclusion

    What are your concerns? Ideas?Let's make this betterReady for early adopters!Work with me on next-gen telemetry!Tips on coordinating spec development?How does FB/G/AMZ/MS/APL/... do this stuff

  • Seen in this presentation:

    metrics20.org

    vimeo.github.io/graph-explorer

    github.com/vimeo/timeserieswidget

    github.com/vimeo/carbon-tagger

    github.com/vimeo/statsdaemon

    github.com/graphite-ng/carbon-relay-ng

    github.com/Dieterbe/anthracite

  • You might also like:

    github.com/vimeo/graphite-influxdbgithub.com/vimeo/graphite-api-influxdb-dockerGithub.com/vimeo/whisper-to-influxdb

    github.com/Dieterbe/influx-cli

    github.com/graphite-ng/graphite-ng

    Github.com/vimeo/smoketcpGithub.com/vimeo/tailgate

  • Stay in touch!

    groups.google.com/forum/#!forum/metrics20groups.google.com/forum/#!forum/it-telemetry

    twitter.com/Dieter_bedieter.plaetinck.be

  • Q&A

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 13Slide 14Slide 16Slide 17Slide 19Slide 22Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 31Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54Slide 55Slide 56Slide 57Slide 58Slide 59Slide 60Slide 61Slide 62Slide 63Slide 64Slide 65Slide 68Slide 69Slide 70Slide 71Slide 72Slide 73Slide 74Slide 75Slide 76Slide 77Slide 78Slide 79Slide 80Slide 81Slide 82Slide 83Slide 84Slide 85Slide 87Slide 90Slide 91Slide 92Slide 94Slide 95Slide 97Slide 99Slide 100Slide 101Slide 102Slide 104Slide 106Slide 107Slide 108Slide 109Slide 110Slide 111Slide 112Slide 114Slide 115Slide 116Slide 117Slide 118Slide 119Slide 120Slide 121Slide 122Slide 123Slide 124Slide 125Slide 126Slide 127Slide 128Slide 130Slide 131Slide 132Slide 133Slide 134