IT Monitoring WG IT/CS Monitoring System
-
Upload
kermit-wooten -
Category
Documents
-
view
50 -
download
2
description
Transcript of IT Monitoring WG IT/CS Monitoring System
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
IT Monitoring WG
IT/CS Monitoring System
Virginie Longo September 14th 2011
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Summary
CS Monitoring Systems• Spectrum CA• Performance Analysis• Others Tools
Data storage Requirements
• NMS Status• Requirements• Researches
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
CS Monitoring systems
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Spectrum CA
Description:• Commercial Tool• Fault management oriented system • Root Cause Analysis/ alarm Correlation• Topology View• Service Manager => Relation With SLS View• Basic Performance manager
Volumes: • ~3000 devices monitored• Support 3K Laser devices for simple alarm (UP/DOWN)• Thousands of attributes polled and analyzed• 6GB of data events over 30 days
Monitoring Protocols:• SNMP and ICMP
Þ Information only feed by SNMP (No remote agent)• Few other support : DNS / DHCP / TRACEROUTE /NTP
/HTTP• Few home maid scripts for DHCP, web monitoring.
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Alarm Monitoring
Spectrum Architecture (Storage system)
Spectrum DB
Models , topology, current polling value ,alarms
SNMP
SSLogger
Oracle
Stats(CSR)
Oracle
Alarm History(LANDB)
Alarm Notifier
Spectrum System Non Spectrum system
Mysql
Events
Remote Mysql
Service Manager
SLS
Devices Info
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Performance Analysis
Statistics Architecture - Mix home maid system and Spectrum tool- Extraction data from Spectrum to Oracle DB- Data consolidation into RRD.- Displayed on Netstat website (PHP).
Volumes:- ~9000 models (port + devices) for 24K of RRDs- 36 Metrics- 157 Attributes- ~160K entries load into Oracle DB for 5MN of poll- Data kept 1 months for oracle- 2 years of consolidated data in RRDs.
Note : Metric is a group of attributes such as Bandwidth = in/out bits and in/out packets.
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Performance Analysis
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Other Tools
Syslog event recording- Gathering all log from network devices- Stored into Oracle DB- Accessible from CSDB- Filtering and propagation by notification
LHCOPN : Perfsonar Tool- Decentralized networks tool- OWD, latency and throughput regular test- Other tools like traceroute - LHCOPN network analysis
Implementation ongoing, testing phase with 1BG link, security tests not complete yet.(www.perfosnar.net)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Data storage
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Data Storage
Summary:• Spectrum proprietary DBs for core and alarms • Mysql database for events and service manager• Oracle database for stats (CSR) and alarm
history (LANDB)• Oracle database for Syslog info• Standalone Mysql database for Perfsonar tools.
Þ Too many different type of storage.Þ Missing correlation between Syslog and SNMP
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Requirements
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
NMS Status
• Advantages :- Root cause analysis efficient- Correct Event- Alarm management- High availability - Really good topology views (useful for intervention group)- Support NICE users- Very good level of filtering (topology, alarms)
- Notification support
• Negative points / Weakness- Expensive- Polling limitation is almost reached
(new version with complete redraw of polling system will arrive in 2 years)- Not a performance system: can’t handle 50K of statistics- Integration of non certificated manufacturer is complex- Data collection mostly limited to SNMP (changes ongoing)
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Requirements
Mandatory: Root Cause Analysis High polling system :1-2mn for critical nodes 3-5mn for others Network topology representation Notifications (SMS/ MAIL/XMPP) and general console Distributed environment High Availability System Complete performance management IPv6 Support
Nice to have : Autodiscovery system Mobile version Oracle centralized database
Numbers and storage time : Polling capacity for at least 5K nodes Performance statistics for 56K of ports Data lifetime: 1 month without aggregation, max with aggregation Devices Alarm: around 2 years
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Researches
List of tools which fit better :• Icinga: Nagios like (forked) (Not Yet Tested) • Zabbix: Large polling scale, open source, notification, Oracle database,
distributed (NYT)(http://www.zabbix.com/features.php)
• Solarwind: commercial but include performance and less expensive (NYT)• Opennms :
Open source - Completely customizable High polling system with distributed environment Events correlation, Alarm management, notification Many data collection support (SNMP, HTML, JMX, JDBC, NAGIOS-NSCLIENT)
(http://www.opennms.org/about/)
Links :• http://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems• http://www.slac.stanford.edu/xorg/nmtf/nmtf-tools.html
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
Thanks Questions ?