The status of the SOWN network is continuously monitored by Icinga. Further details of the status of the servers and Wi-Fi nodes are available on this website. The locations of these nodes are displayed on the SOWN network map.
These monitored to check that they are online and will be described as either UP or DOWN. Servers are broken down into two main categories Production and Development. This is indicated by their colouring on SOWN's network topology.
There are numerous services monitored across all servers. Generally these services should show as OK indicating the server is behaving as expected. There are some generic services monitored across most if not all servers:
- Whether all the directories (and the files they contain) listed in the backup configuration for this host have successfully been backed up the last time the backup process ran. This shows CRITICAL if any file failed to be backed up. As this is a passive check it also shows CRITICAL if a check has not been submitted within 26 hours of the previous check.
- Whether the Cron jobs recorded on the SOWN admin system for this host match the one reported by this host. This reports WARNING when the sets of cron jobs do not match. As it is a passive check it also shows CRITICAL if a check has not been submitted within 125 minutes of the previous check.
- Checks the state of the server via systemd. It will report CRITICAL if the system state is "degraded", as well as a list of failed services. If the state is something other than degraded or running, it will report WARNING.
- For physical servers only, checks if the server's IPMI IP responds to pings. Note that many of our older servers don't respond to normal ICMP ping, so this instead uses RMCP ping. Returns CRITICAL if the server doesn't respond, based on rmcpping's exit code.
- Whether various binaries on the hosts have been modified from the standard packaged version. If so this is an indication that a server may have been hacked and being used for malicious possesses, (e.g. harvesting passwords). This check reports CRITICAL if any of the binaries are different to what is expected, For less powerful server this is a passive check. In this case it will report as CRITICAL if no check has been submitted in the last 70 minutes.
- Whether the space on a particular disk partition on a host has dropped below a specified threshold. Typically this will report WARNING if there is less than 20% space remaining and CRITICAL if there is less than 10%.
- Whether the DNS forward and reverse records (as reported by SOWN's slave DNS server, Sown-gw match the information stored by Icinga. For servers with SOWN IP addresses and host names it only checks these. It will report CRITICAL if either an expected forward or reverse record is missing. It will report WARNING if the forward and reverse records do not correspond. For servers without SOWN IP addresses it checks all forward and reverses and reports CRITICAL if any are missing. It does not check if forward ad reverse records correspond.
- Whether a list of the packages in stalled on the host have been backed up. As this is a passive check it also shows CRITICAL if a check has not been submitted within 26 hours of the previous check.
- Whether a host can be ping on all its global IP addresses (v4 and v6) can be ping from the External Monitor server. Like PING, this shows CRITICAL if none of these addresses is pingable or there is very high packet loss or Return Trip Time (RTT). This shows WARNING if one or more (but not all) these IP addresses are non-responsive or there is somewhat high packet loss or RTT.
- Whether the CPU load has exceeded as specific threshold on this host. The threshold vary between servers. Typically the WARNING thresholds are 15 for 1-minute load average, 10 for 5-minute load average and 5 for 15-minute load average. The corresponding CRITICAL thresholds are 30, 25 and 20.
- Whether there is any unsent email in the subdirectories within /var/mail/. Reports CRITICAL along with the number and size of messages if there are any unsent emails.
- A Dummy check that that if it can report will report OK with a "NRPE is running" message. If NRPE is not running on the host it will report CRITICAL.
- Checks if there are any uninstalled packages (dpkg) waiting to be installed on the server. Generally, this will only report CRITICAL on a host if there are security packages uninstalled but a number threshold can set for this along with a WARNING threshold. This check will also report CRITICAL if it cannot interpret the output from the APT check for installed packages.
- Whether the server can be pinged on both IPv4 and IPv6 addresses on both its ECS and SOWN interfaces if applicable. This shows CRITICAL if no IP address is pingable or there is very high packet loss or Return Trip Time (RTT). This shows WARNING if one or more (but not all) IP addresses are non-responsive or there is somewhat high packet loss or RTT.
- Whether the number of processes running on the host exceeds particular thresholds. The default for CRITICAL is 300 or greater and for WARNING is 250 or greater. These will vary higher or lower depending on the specification of the host and the type of job(s) it does.
- Whether the server need rebooting. This will report as CRITICAL if package upgrade require a reboot (e.g. new kernel, openssl or similar packages have had security updates) as it used debian-goodies /var/run/reboot-required to check whether a reboot is required. This also reports WARNING if one of the disks is scheduled to be FSCK-ed on next reboot.
- Whether the attributes of the server (RAM, hard disk(s) size(s), CPU(s), network interfaces, OS and Linux kernel version) have been reported to the SOWN admin server recently. This is a daily passive check so reports CRITICAL if no check has been submitted in 26 hours.
- Whether the SSH running on the server can be connect to over port 22. This shows CRITICAL if SSH cannot be connected to over any IP address (IPv4 and IPv6, ECS and SOWN interfaces). This shows WARNING if one or more (but not all) IP addresses cannot connect to SSH or the time to connect is somewhat high.
- Reports the Linux kernel currently running on the host.
- Reports the uptime of the host.
- Reports the number of users logged into a host over SSH. By default reports CRITICAL on 20 or more users and <WARNING on 10 or more users.
- Whether the host has an excessive number of zombie process running. Typically this reports CRITICAL if there are greater than 10 zombie processed running and WARNING if there are greater than 5.
There a number of service check performed on multiple servers but not the majority of servers.
- Whether backups were successfully transferred to the host that is a backup server. Reports CRITICAL if transfer was not successful or check has not reported within 26 hours.
- Whether the RADIUS authentication protocol test for a particular domain (e.g. EAPOL-SOTON is for the soton.ac.uk domain) where successful on a particular host. Reports CRITICAL if no protocols were able to successfully authenticate. Typically reports WARNING if some but not all protocols successfully authenticated but this is configurable per host/domain pair.
- Whether a host can be accessed over HTTP from the External Monitor server. Like the HTTP services, HTTP4 and HTTP6 can be used to check specifically on IPv4 and IPv6 addresses. This returns CRITICAL if there is no response.
- Whether am HTTP response is successfully returned for this host. HTTP4 and HTTP6 can be used to check specifically on IPv4 and IPv6 addresses. If followed by a hyphen then a number these means HTTP on a different TCP port (e.g. HTTP-8080). Sometimes the check is designed to see whether the request redirects elsewhere. Typically this returns CRITICAL if there is no response but sometimes it will also do this is the response does not contain a specific text string or redirect if this is expected
- Whether am HTTPS response is successfully returned for this host. HTTPS4 and HTTPS6 can be used to check specifically on IPv4 and IPv6 addresses. Sometimes the check is designed to see whether the request redirects elsewhere. Typically this returns CRITICAL if there is no response but sometimes it will also do this is the response does not contain a specific text string or redirect if this is expected
- Whether the HTTPS certificate has expired or is otherwise invalid. Reports CRITICAL if certificate has expired or will expire in less than 30 days (22 days for Let's Encrypt certificate) or if the certificate is revoked or does not have a full chain.
- Whether the MySQL databases on this server have been successfully backed up. Reports CRITICAL if the backup (mysqldump) was unsuccessful or check has not reported within 26 hours.
- Whether the CPU temperature is below a particularly threshold. This is only checked on physical hosts. Typically this reports CRITICAL if the CPU temperature is above 60°C and just WARNING if the CPU temperature is only over 50°C.
Some services are quite bespoke and run on just one of two hosts. This are describe on the individual hosts that include Sown-auth2, External Monitor, Sown-gw, Sown-monitor, Sown-monitor-dev and Sown-www. Some of these bespoke services are for non-SOWN hosts such as:
- Check whether the ambient temperature (currently on the hard disk sensor in a non-SOWN server with the hostname director.ecs.soton.ac.uk) of the B32 level 3 south server room's ambient temperature is overly high. Reports CRITICAL when temperature is over 50C and or just WARNING if temperature is only over 40C.
- Checks whether ECS's IRC server hash.ecs.soton.ac.uk has an IRC server running. Reports CRITICAL if it cannot connect to the IRC server.
- Similar to the DEBSUMS checks but only checks that the SSH binary has not changed. Reports CRITICAL if it has changed. Like some other DEBSUMS checks this is a passive checks so it will also report CRITICAL if it has failed to report in an hour, based on a Cron job that runs every 5 minutes.
- Checks whether ECS's monitoring server is monitoring Sown-auth2. This is determined by having a rule in Iptables that generates a Syslog message that SyslogNG parses to call a script that submits a passive check. If a passive check is not submitted at least every 10 minutes, then the check reports as CRITICAL.
Like servers, these monitored to check that they are online and will be described as either UP or DOWN. Nodes are also broken down into two main host groups Production and Development. Beyond this nodes can be Home nodes (connected over VPN) or they can be Native to the SOWN network. Finally nodes can belong the "External Build" host group if they are a non-standard build
There are two main types of node service. Generic services that are checked on all nodes and then checks that are only run on SOWN build nodes as "External Build" nodes would not have the required functionality to pass he checks.
- Whether the usage of the node is above a node admin specific usage cap according to RADIUS accounting. Reports CRITICAL if the usage is over the admin site specified cap for the deployment. (A usage cap set to 0 means unlimited cap).
- If the DNS reverse record for the node's IP is set correctly. Reports CRITICAL if the forward record (hostname) does not exists or the hostname does not return the expected IPv4 address for the node.
- Whether the node can be successfully pinged. This shows CRITICAL if no IP address is pingable or there is very high packet loss or Return Trip Time (RTT). This shows WARNING if one or more (but not all) IP addresses are non-responsive or there is somewhat high packet loss or RTT.
SOWN Build Only Services
- This reports WARNING if the Crontab is not as expected. This is a passive check so reports CRITICAL if does nor report every 190 minutes. It should report every hour but as the node is connected remotely, this allows for a couple of missed reports.
- This reports WARNING if the configuration for the wireless interface(s) on the node is not as expected. This is a passive check so reports CRITICAL if does nor report every 190 minutes. It should report every hour but as the node is connected remotely, this allows for a couple of missed reports.
- Whether DHCP leases are being requested/offered by/to client devices connecting over Wi-Fi after successful 802.1X authentications. This is a passive check as it is checked as part of the SSH-NODE-PASSWORD check. This reports CRITICAL if in the most recent hour the node has successful 802.1X authenications there are no DHCP lease requests or offers (according to syslog). This shows as UNKNOWN if there have been no successful 802.1XC authentications in the last 50 hours.
- Whether there is a sufficient amount of free memory (RAM). This is checked over SNMP. It reports CRITICAL if there is less than 1 MiB free, WARNING if there is less than 2 MiB free and and UNKNOWN if the node could not be connected to over SNMP.
- Whether the node can be successfully SSH-ed into. This is a passive check as it is checked as part of the SSH-NODE-PASSWORD check. This reports CRITICAL if the node cannot be SSH-ed into. This shows as UNKNOWN if it is a passive check and has not reported in the last 70 minutes.
- Checks whether the node password has not been changed by comparing with a hash stored in the admin system with the one in the node's /etc/shadow file. This reports CRITICAL if the hashes do not match.
- Whether syslog on the node had connected back to send its logs to the appropriate SOWN server. This checks the lsof generated list of established syslog connections on the appropriate SOWN server (via an Admin Site check script) to find any that match the node's primary IP address. This reports CRITICAL if there are no connections and WARNING if there is more than 1 connection.
- Whether the config update script has successfully run in the last hour. This is a passive check reported by the hourly cron job script on the node itself. This reports CRITICAL if no passive check as been submitted in the last 3 hours.
- Whether there are any packages that need to be updated on the node. This is checked over SNMP. This reports WARNING if any packages need upgrading and UNKNOWN if the node could not be connected to over SNMP.
- Checks how many openvpn processes are running on the node. This reports CRITICAL there is not only one process running. If the check fails to run correctly it reports WARNING.