Nagios
Deprecated
This page is deprecated and does not reflect the current state of SOWN.
Nagios is no longer used by SOWN. Instead we use a fork of the codebase called Icinga.
Nagios | |
---|---|
Nagios | |
Installed on | |
Was installed on | Sown-monitor |
Website | http://www.nagios.org/ |
Nagios® is an Open Source host, service and network monitoring program. SOWN uses it to monitor our services and to check node availability. In the event that a node and/or server go down, Nagios informs the technical team via IRC and sends an email to the appropriate admin.
Contents
Configuration
Nagios is being configured out of the main database. This is carried out by an hourly cron job.
This automatic configuration applies to all campus and home nodes.
Host checks
Host status is determined through the use of a ping check. A node considered to be down if it has an RTA > 5000ms or a packet loss of 100%, with 5 packets sent for each check.
Service checks
Service checks are carried out differently depending on the service and the host.
Ping
Home Nodes
The ping service is considered to be in a warning state if RTA > 500ms or packet loss > 20%. A critical state is entered if RTA > 1500ms or packet loss > 60%.
Other Hosts
The ping service is considered to be in a warning state if RTA > 100ms or packet loss > 20%. A critical state is entered if RTA > 500ms or packet loss > 60%.
SSH
The SSH service checks whether the SSH service is accepting connections.
DNS
The DNS service checks that the SOWN DNS server contains a correct record for the hostname of the host.
NRPE
NRPE is a standalone daemon that allows other checks, such as Disk, Load, Users, Procs and Zombie to be carried out. Each server has it own configuration file that it needs to be parsed when the daemon is run to make sure the right things are checked.
Backup
The Backup service checks that a backup of the server's backed up files have been made in the past 26 hours (to allow time for the daily backup to take place). The status of the backup is communicated passively using ncsa.
OfflineBackup
The OfflineBackup service checks that a manual backup has been taken. The status of the offline backup is communicated passively using ncsa.
MySQLBackup
The MySQLBackup service checks that the server's MySQL database has been backed up. The status of the MySQL backup is communicated passively using ncsa.
Disk
The Disk service checks the free space on the server's hard disk partitions. The service is considered to be in a warning state if the free percentage is < 20%, and a critical state is entered if the free percentage < 10%. This check is performed using nrpe.
Load
The Load service checks the load of the server's CPU. The service is considered to be in a warning state if the load averages exceed 15,10,5, and a critical state is entered if the load averages exceed 30,25,20. This check is performed using nrpe.
Users
The Users service checks the number of users logged in to the server. The service is considered to be in a warning state if the number of users > 5, and a critical state is entered if the number of users > 10. This check is performed using nrpe.
Procs
The Procs service checks the number of processes running on the server. The service is considered to be in a warning state if the number of processes > 150, and a critical state is entered if the number of processes > 200. This check is performed using nrpe.
Zombie
The Zombie service checks the number of zombie processes running on the server. The service is considered to be in a warning state if the number of zombie processes > 5, and a critical state is entered if the number of zombie processes > 10. This check is performed using nrpe.
Temp
Using our own check_temp plugin this checks the temperature of any local hard disks using the hddtemp package.
Notification
Nagios is configured to send a number of different types of notification. Email notifications are discussed further below. In addition, all notifications are made to the #sown IRC channel via the sown-bot.
The system is configured for notifications as follows:
Core Servers
Host and Service notifications sent to 'hostmaster' contact immediately.
Dev Servers
Host and Service notifications sent to 'hostmaster' contact immediately.
Campus Nodes
Host and Service notifications sent to 'nodemaster' contact immediately.
Home Nodes
Host notifications sent to the appropriate nodeadmin after 2 hours, with 48 hour interval. Service notifications are not sent to the nodeadmin.
No home node notifications will be generated or attempted to be sent if SOWN does not have a connection to the 'Internet'.
Snapshot
In addition, the state of the network is regularly communicated to SOWN Support via the Nagios Snapshot.