Monitoring Storm - Meetupfiles.meetup.com/5809742/storm monitoring.pdf ·  · 2013-06-13Monitoring...

18
Monitoring Storm Visible Measures June 2013 Ilya Kaplun [email protected]

Transcript of Monitoring Storm - Meetupfiles.meetup.com/5809742/storm monitoring.pdf ·  · 2013-06-13Monitoring...

Monitoring Storm

Visible Measures June 2013 Ilya Kaplun [email protected]

Agenda

•  Why Monitor •  What to Monitor •  How to Monitor •  Conclusion

Why Monitor

•  Measure performance –  tune Storm settings

•  Troubleshoot •  Anticipate problems – Alerts

•  Historical problem investigation •  Application Metrics

What to Monitor: Performance

•  Per-topology

–  Throughput

–  Latency

•  Per-bolt and per-spout

–  Throughput

–  Latency

–  Queuing time

–  Execution time

•  Per-source and per-sink (store)

–  Read/write times

–  Batch sizes

What to Monitor: Troubleshooting

•  Is there a drop in throughput?

•  Track down latency increases

•  Debugging

•  Detect JVM issues (memory, GC)

•  Detect Hardware issues

•  Are any stores having problems

•  Drill down into specific workers / bolts

How to monitor: Tools

•  Storm UI

•  JMX / VisualVM

•  Yammer Metrics

–  collect metrics within a single JVM

•  Graphite

–  collect and graph metrics

•  Log4j

–  configurable logs

•  Nagios

–  monitor hardware, logs

How to Monitor: Storm UI

•  Metrics aggregated by the Nimbus

•  Available in fixed time intervals

•  Counters rather than rates

•  Thrift from the Nimbus

•  Spout latency vs. processing latency to find queuing.

•  Not persistent (topology redeploy clears stats)

How to Monitor: Metrics

•  Yammer Metrics aggregates on each worker –  Metrics

•  Counters

•  Meters

•  Timers

–  Aggregations •  Mean vs. Median

•  Percentiles (50%,75%,95%,99%)

•  Max

–  Graphite Reporter •  Customize which metrics to send

•  Customize interval

•  Graphite aggregates across workers

•  Hierarchical Metric Naming –  storm-prod.storm01.worker-6703.validator_bolt.3.time_in_queue.median

How to Monitor: Tracing

•  Tuples are timestamped at the edge nodes

•  Small trace object added to every tuple

•  Enable/disable per topology

•  Tracks departure/arrival times in spouts/bolts

•  Can measure time spent queued up before

a bolt

•  Track end-to-end latency

How to Monitor: JMX

•  Troubleshoot specific workers

•  Thread dumps / Heap Dumps

•  Enable JMX (storm.yaml) –  worker.childopts: ”

-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1%ID%”

•  Connect to a worker with VisualVM

•  JVM libraries include JMX MBeans

•  Can run certain operations

How to Monitor: Logs

•  Storm uses Log4J –  switching to Logback in 0.9

•  /opt/storm/log4j/storm.log.properties !

–  have to change on each node

•  MDC for bolt context

•  Nagios monitors worker logs for ERROR!

•  Log invalid tuples –  prevent infinite loops

•  Enable supervisor/worker logging to catch restarts

Examples

•  Bottlenecks in Geo service lookup –  tuning bolt parallelism

•  Writing to Vertica (found bugs)

•  Slowdown in Redis writes (found failing disk)

•  Bad RAID controller

New Metrics in 0.8.2

•  Process latency –  time until tuple is acked

•  execute latency –  time spent in execute for a tuple

•  capacity – what % of the time in the last 10 minutes

the bolt spent executing tuples

Other Options

•  Ooyala metrics_storm!– very similar to our approach –  https://github.com/ooyala/metrics_storm

•  Storm 0.9 Metrics – Collect arbitrary, custom metrics over

fixed windows.

– Metrics aggregated within Storm

– Can define custom MetricsConsumer

Questions?