Monitoring Storm - Meetupfiles.meetup.com/5809742/storm monitoring.pdf · · 2013-06-13Monitoring...
Transcript of Monitoring Storm - Meetupfiles.meetup.com/5809742/storm monitoring.pdf · · 2013-06-13Monitoring...
Why Monitor
• Measure performance – tune Storm settings
• Troubleshoot • Anticipate problems – Alerts
• Historical problem investigation • Application Metrics
What to Monitor: Performance
• Per-topology
– Throughput
– Latency
• Per-bolt and per-spout
– Throughput
– Latency
– Queuing time
– Execution time
• Per-source and per-sink (store)
– Read/write times
– Batch sizes
What to Monitor: Troubleshooting
• Is there a drop in throughput?
• Track down latency increases
• Debugging
• Detect JVM issues (memory, GC)
• Detect Hardware issues
• Are any stores having problems
• Drill down into specific workers / bolts
How to monitor: Tools
• Storm UI
• JMX / VisualVM
• Yammer Metrics
– collect metrics within a single JVM
• Graphite
– collect and graph metrics
• Log4j
– configurable logs
• Nagios
– monitor hardware, logs
How to Monitor: Storm UI
• Metrics aggregated by the Nimbus
• Available in fixed time intervals
• Counters rather than rates
• Thrift from the Nimbus
• Spout latency vs. processing latency to find queuing.
• Not persistent (topology redeploy clears stats)
How to Monitor: Metrics
• Yammer Metrics aggregates on each worker – Metrics
• Counters
• Meters
• Timers
– Aggregations • Mean vs. Median
• Percentiles (50%,75%,95%,99%)
• Max
– Graphite Reporter • Customize which metrics to send
• Customize interval
• Graphite aggregates across workers
• Hierarchical Metric Naming – storm-prod.storm01.worker-6703.validator_bolt.3.time_in_queue.median
How to Monitor: Tracing
• Tuples are timestamped at the edge nodes
• Small trace object added to every tuple
• Enable/disable per topology
• Tracks departure/arrival times in spouts/bolts
• Can measure time spent queued up before
a bolt
• Track end-to-end latency
How to Monitor: JMX
• Troubleshoot specific workers
• Thread dumps / Heap Dumps
• Enable JMX (storm.yaml) – worker.childopts: ”
-Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.local.only=false -Dcom.sun.management.jmxremote.port=1%ID%”
• Connect to a worker with VisualVM
• JVM libraries include JMX MBeans
• Can run certain operations
How to Monitor: Logs
• Storm uses Log4J – switching to Logback in 0.9
• /opt/storm/log4j/storm.log.properties !
– have to change on each node
• MDC for bolt context
• Nagios monitors worker logs for ERROR!
• Log invalid tuples – prevent infinite loops
• Enable supervisor/worker logging to catch restarts
Examples
• Bottlenecks in Geo service lookup – tuning bolt parallelism
• Writing to Vertica (found bugs)
• Slowdown in Redis writes (found failing disk)
• Bad RAID controller
New Metrics in 0.8.2
• Process latency – time until tuple is acked
• execute latency – time spent in execute for a tuple
• capacity – what % of the time in the last 10 minutes
the bolt spent executing tuples
Other Options
• Ooyala metrics_storm!– very similar to our approach – https://github.com/ooyala/metrics_storm
• Storm 0.9 Metrics – Collect arbitrary, custom metrics over
fixed windows.
– Metrics aggregated within Storm
– Can define custom MetricsConsumer