Update Needed
This page needs to be updated
Needs major updating following various recent changes
| Name | Nagios |
|---|---|
| Deployed on | |
| Website | http://www.nagios.org/ |
NagiosĀ® is an Open Source host, service and network monitoring program. SOWN uses it to monitor our services and to check node availability. In the event that a node and/or server go down, Nagios informs the technical team via IRC and sends an email to the appropriate admin.
Contents |
apt-get install nagios-nrpe-server
apt-get install nagios-plugins
apt-get install nagios-nrpe-plugins
The you install the config file and ensure all of the pathing is correct. Also the check_temp command needs our own check_temp plugin and also requires root access to run hddtemp.
Nagios is being configured out of the main database. This is carried out by an hourly cron job.
This automatic configuration applies to all campus and home nodes.
Host status is determined through the use of a ping check. A node considered to be down if it has an RTA > 5000ms or a packet loss of 100%, with 5 packets sent for each check.
Service checks are carried out differently depending on the service and the host.
The ping service is considered to be in a warning state if RTA > 500ms or packet loss > 20%. A critical state is entered if RTA > 1500ms or packet loss > 60%.
The ping service is considered to be in a warning state if RTA > 100ms or packet loss > 20%. A critical state is entered if RTA > 500ms or packet loss > 60%.
The SSH service checks whether the SSH service is accepting connections.
The DNS service checks that the SOWN DNS server contains a correct record for the hostname of the host.
NRPE is a standalone daemon that allows other checks, such as Disk, Load, Users, Procs and Zombie to be carried out. Each server has it own configuration file that it needs to be parsed when the daemon is run to make sure the right things are checked.
The Backup service checks that a backup of the server's backed up files have been made in the past 26 hours (to allow time for the daily backup to take place). The status of the backup is communicated passively using ncsa.
The OfflineBackup service checks that a manual backup has been taken. The status of the offline backup is communicated passively using ncsa.
The MySQLBackup service checks that the server's MySQL database has been backed up. The status of the MySQL backup is communicated passively using ncsa.
The Disk service checks the free space on the server's hard disk partitions. The service is considered to be in a warning state if the free percentage is < 20%, and a critical state is entered if the free percentage < 10%. This check is performed using nrpe.
The Load service checks the load of the server's CPU. The service is considered to be in a warning state if the load averages exceed 15,10,5, and a critical state is entered if the load averages exceed 30,25,20. This check is performed using nrpe.
The Users service checks the number of users logged in to the server. The service is considered to be in a warning state if the number of users > 5, and a critical state is entered if the number of users > 10. This check is performed using nrpe.
The Procs service checks the number of processes running on the server. The service is considered to be in a warning state if the number of processes > 150, and a critical state is entered if the number of processes > 200. This check is performed using nrpe.
The Zombie service checks the number of zombie processes running on the server. The service is considered to be in a warning state if the number of zombie processes > 5, and a critical state is entered if the number of zombie processes > 10. This check is performed using nrpe.
Using our own check_temp plugin this checks the temperature of any local hard disks using the hddtemp package.
Nagios is configured to send a number of different types of notification. Email notifications are discussed further below. In addition, all notifications are made to the #sown IRC channel via the sown-bot.
The system is configured for notifications as follows:
Host and Service notifications sent to 'hostmaster' contact immediately.
Host and Service notifications sent to 'hostmaster' contact immediately.
Host and Service notifications sent to 'nodemaster' contact immediately.
Host notifications sent to the appropriate nodeadmin after 2 hours, with 48 hour interval. Service notifications are not sent to the nodeadmin.
No home node notifications will be generated or attempted to be sent if SOWN does not have a connection to the 'Internet'.
In addition, the state of the network is regularly communicated to SOWN Support via the Nagios Snapshot.