| Name | Nagios |
|---|---|
| Deployed on | sown-vpn,sown-dev |
| Website | http://www.nagios.org/ |
NagiosĀ® is an Open Source host, service and network monitoring program. SOWN uses it to monitor our services and to check node availability. In the event that a node and/or server go down, Nagios informs the technical team via IRC and sends an email to the appropriate admin.
Contents |
Nagios is being configured out of the main database. This is carried out by an hourly cron job.
This automatic configuration applies to all campus and home nodes.
Host status is determined through the use of a ping check. A node considered to be down if it has an RTA > 5000ms or a packet loss of 100%, with 5 packets sent for each check.
Service checks are carried out differently depending on the service and the host.
The ping service is considered to be in a warning state if RTA > 500ms or packet loss > 20%. A critical state is entered if RTA > 1500ms or packet loss > 60%.
The ping service is considered to be in a warning state if RTA > 100ms or packet loss > 20%. A critical state is entered if RTA > 500ms or packet loss > 60%.
The SSH service checks whether the SSH service is accepting connections.
The DNS service checks that the SOWN DNS server contains a correct record for the hostname of the host.
NRPE is a standalone daemon that allows other checks, such as Disk, Load, Users, Procs and Zombie to be carried out. Each server has it own configuration file that it needs to be parsed when the daemon is run to make sure the right things are checked.
The Backup service checks that a backup of the server's backed up files have been made in the past 26 hours (to allow time for the daily backup to take place). The status of the backup is communicated passively using ncsa.
The OfflineBackup service checks that a manual backup has been taken. The status of the offline backup is communicated passively using ncsa.
The MySQLBackup service checks that the server's MySQL database has been backed up. The status of the MySQL backup is communicated passively using ncsa.
The Disk service checks the free space on the server's hard disk partitions. The service is considered to be in a warning state if the free percentage is < 20%, and a critical state is entered if the free percentage < 10%. This check is performed using nrpe.
The Load service checks the load of the server's CPU. The service is considered to be in a warning state if the load averages exceed 15,10,5, and a critical state is entered if the load averages exceed 30,25,20. This check is performed using nrpe.
The Users service checks the number of users logged in to the server. The service is considered to be in a warning state if the number of users > 5, and a critical state is entered if the number of users > 10. This check is performed using nrpe.
The Procs service checks the number of processes running on the server. The service is considered to be in a warning state if the number of processes > 150, and a critical state is entered if the number of processes > 200. This check is performed using nrpe.
The Zombie service checks the number of zombie processes running on the server. The service is considered to be in a warning state if the number of zombie processes > 5, and a critical state is entered if the number of zombie processes > 10. This check is performed using nrpe.
Nagios is configured to send a number of different types of notification. Email notifications are discussed further below. In addition, all notifications are made to the #sown IRC channel via the sown-bot.
The system is configured for notifications as follows:
Host and Service notifications sent to 'hostmaster' contact immediately, with hourly interval.
Host and Service notifications sent to 'hostmaster' contact immediately, with hourly interval.
Host and Service notifications sent to 'nodemaster' contact immediately, with hourly interval.
Host notifications sent to 'nodemaster' contact and to the appropriate nodeadmin after 2 hours, with 48 hour interval. Service notifications sent to 'nodemaster' contact after 30 mins, with hourly interval.
No home node notifications will be generated or attempted to be sent if SOWN does not have a connection to the 'Internet'.
| Has contributor | Crwilliams + |
| Has url | http://www.nagios.org/ + |
| Installed on | Sown-vpn +, and Sown-dev + |