System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019...

20
© 2019 Cray Inc. [email protected] System Monitoring Framework for Shasta CUG 2019

Transcript of System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019...

Page 1: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

[email protected]

S y s t e m M o n i t o r i n g

F r a m e w o r k f o r S h a s t a

CUG 2019

Page 2: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• Overview of the system monitoring framework

• Subsystems contributing metrics

• Correlating data with visualization tools

• Summary

• Q&A

2

TOPICS

Page 3: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc. 3

O v e r v i e w

Page 4: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• What is the System Monitoring Framework?

• A tightly integrated framework for collecting and persisting metrics and logs

• Consolidates telemetry data from multiple subsystems

- Switch fabric - Power

- Network - User Applications

- Job Management - Compute

- Storage

• Integrated alarm and notification framework with threshold engine

• Standard visualization tools for graphing metrics and searching logs

• RESTful API for integration into customers monitoring solutions

• Integrated with the diagnosability and serviceability solutions

4

SYSTEM MONITORING FRAMEWORK

Page 5: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc. 5

ARCHITECTURE AND DATA SOURCES

Page 6: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc. 6

D a t a S o u r c e s

Page 7: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• Shasta hardware

• ClusterStor storage

• Compute nodes

• Network and fabric

• Logs

7

SUBSYSTEMS CONTRIBUTING METRICS

Page 8: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• Collect metrics from

• Chassis controllers

• Node controllers

• Blade switch controllers

• PDUs

• TOR switches

• Collected using industry standard redfish API

8

HARDWARE MANAGEMENT METRICS

Page 9: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• Metrics collected

• Lustre performance

• Metadata, OST I/O read/write

• Lustre jobstats

• Logs and events

• Collection rate : 15 to 30 seconds

• Calculated into delta rates and persisted

• Enables trend analysis

9

CLUSTERSTOR STORAGE METRICS

Page 10: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc. 10

COMPUTE NODE METRICS VIA LDMS

Page 11: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc. 11

COMPUTE NODE METRICS• Six main categories: I/O, System, CPU, Swap, Processes & Memory• Total of 13 metrics sampled at 10 second interval

Page 12: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• Metrics are collected to enable monitoring and diagnosis of performance and congestion of the fabric

• These metrics will include:

• Critical asynchronous link events and port state changes

• e.g. used for diagnosis of link/cable issues

• Running state data based on a configured set of standard SNMP MIBs

• RFCs 1213, 2819, 2863, 3635, 4188, 4293

• Data periodically posted, period is configurable

• Types of bandwidth and congestion metrics collected

• Packets/bytes in/out

• Unicast/Multicast/Broadcast

• Drops/errors

• Pause Frames in/out

• e.g. excessive transmit pause frames used to identify error at endpoint device

• All telemetry data includes locality of metric

• Provides ability for focused query/heat map generation on specific area of the fabric

12

NETWORK/FABRIC METRICS

Page 13: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

L o g A g g r e g a t i o n

13

Page 14: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

LOG AGGREGATION

Container ServiceContainer Service

Container Service

Rsyslog Collector

Container Service

Logging Sidecar

Rsyslog Collector

Kafka Bus

Logstash

ElasticSearch

Other Persistent

Store

Kibana

Other GUI

Base-OS syslogBase-OS syslog

ClusterStor Logs

14

Rsyslog

Aggregator

Telemetry API

Page 15: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

I n t e g r a t i o n w i t h 3 r d P a r t y M o n i t o r i n g S y s t e m

15

Page 16: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

TELEMETRY API

16

Compute

Data Sources Kafka

A

P

I

1

2

N

Telemetry API

Shasta Monitoring Framework

1

2

N

API Clients

Kafka Clients

Customer

Network

Jobs

Power

Storage

Page 17: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc. 17

USE CASE DEMO

Page 18: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

• The System Monitoring Framework aggregates metrics into a single framework

• Telemetry collected includes :

- Shasta hardware - VMStats from compute nodes

- Storage lustre and job metrics - Network and fabric metrics

- Logs

• Tools are provided to enable trend analysis, searching, and correlating of data

• A REST API is provided to allow streaming of telemetry off the kafka bus into customer monitoring solutions

18

SUMMARY

Page 19: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

© 2019 Cray Inc.

S A F E H A R B O R S TAT E M E N T

This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts.

These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.

19

Page 20: System Monitoring Framework for Shasta › proceedings › cug2019_proceedings › ... · © 2019 Cray Inc. planger@cray.com System Monitoring Framework for Shasta CUG 2019

THANK YOU

Q U E S T I O N S ?