Nagios

From SOWNWiki

Jump to: navigation, search

logo-yellow.png

Update Needed
This page needs to be updated

Needs major updating following various recent changes

Name Nagios
Deployed on
Website http://www.nagios.org/

NagiosĀ® is an Open Source host, service and network monitoring program. SOWN uses it to monitor our services and to check node availability. In the event that a node and/or server go down, Nagios informs the technical team via IRC and sends an email to the appropriate admin.

Contents

Client Installation on Debian

apt-get install nagios-nrpe-server

apt-get install nagios-plugins

apt-get install nagios-nrpe-plugins

The you install the config file and ensure all of the pathing is correct. Also the check_temp command needs our own check_temp plugin and also requires root access to run hddtemp.

Configuration

Nagios is being configured out of the main database. This is carried out by an hourly cron job.

This automatic configuration applies to all campus and home nodes.

Host checks

Host status is determined through the use of a ping check. A node considered to be down if it has an RTA > 5000ms or a packet loss of 100%, with 5 packets sent for each check.

Service checks

Service checks are carried out differently depending on the service and the host.

Ping

Home Nodes

The ping service is considered to be in a warning state if RTA > 500ms or packet loss > 20%. A critical state is entered if RTA > 1500ms or packet loss > 60%.

Other Hosts

The ping service is considered to be in a warning state if RTA > 100ms or packet loss > 20%. A critical state is entered if RTA > 500ms or packet loss > 60%.

SSH

The SSH service checks whether the SSH service is accepting connections.

DNS

The DNS service checks that the SOWN DNS server contains a correct record for the hostname of the host.

NRPE

NRPE is a standalone daemon that allows other checks, such as Disk, Load, Users, Procs and Zombie to be carried out. Each server has it own configuration file that it needs to be parsed when the daemon is run to make sure the right things are checked.

Backup

The Backup service checks that a backup of the server's backed up files have been made in the past 26 hours (to allow time for the daily backup to take place). The status of the backup is communicated passively using ncsa.

OfflineBackup

The OfflineBackup service checks that a manual backup has been taken. The status of the offline backup is communicated passively using ncsa.

MySQLBackup

The MySQLBackup service checks that the server's MySQL database has been backed up. The status of the MySQL backup is communicated passively using ncsa.

Disk

The Disk service checks the free space on the server's hard disk partitions. The service is considered to be in a warning state if the free percentage is < 20%, and a critical state is entered if the free percentage < 10%. This check is performed using nrpe.

Load

The Load service checks the load of the server's CPU. The service is considered to be in a warning state if the load averages exceed 15,10,5, and a critical state is entered if the load averages exceed 30,25,20. This check is performed using nrpe.

Users

The Users service checks the number of users logged in to the server. The service is considered to be in a warning state if the number of users > 5, and a critical state is entered if the number of users > 10. This check is performed using nrpe.

Procs

The Procs service checks the number of processes running on the server. The service is considered to be in a warning state if the number of processes > 150, and a critical state is entered if the number of processes > 200. This check is performed using nrpe.

Zombie

The Zombie service checks the number of zombie processes running on the server. The service is considered to be in a warning state if the number of zombie processes > 5, and a critical state is entered if the number of zombie processes > 10. This check is performed using nrpe.

Temp

Using our own check_temp plugin this checks the temperature of any local hard disks using the hddtemp package.

Notification

Nagios is configured to send a number of different types of notification. Email notifications are discussed further below. In addition, all notifications are made to the #sown IRC channel via the sown-bot.

The system is configured for notifications as follows:

Core Servers

Host and Service notifications sent to 'hostmaster' contact immediately.

Dev Servers

Host and Service notifications sent to 'hostmaster' contact immediately.

Campus Nodes

Host and Service notifications sent to 'nodemaster' contact immediately.

Home Nodes

Host notifications sent to the appropriate nodeadmin after 2 hours, with 48 hour interval. Service notifications are not sent to the nodeadmin.

No home node notifications will be generated or attempted to be sent if SOWN does not have a connection to the 'Internet'.

Snapshot

In addition, the state of the network is regularly communicated to SOWN Support via the Nagios Snapshot.