For our Linux systems we were able to add a very simple process to cron to touch a file every 5 minutes.

This is a NFS mounted location and we pointed our Monitoring Tools to look for the oldest file in that directory.  All that is in that directory is this ‘heart beat’ file.  Once it is older than 5 minutes  – we turn that into yellow.  Once it is over 10 minutes, then it turns red.  We know then that either that system is having issues, or it is at least no longer connected to the NFS system.

When this was requested, I wondered what the big deal was with creating this process.  When there are hundreds of systems, this turns into the early warning signal that things are going bump in the night.

If there is just a system or two in this state, then no big deal.  Check if they are being touched and might be a reason as to why they are no longer updating.  But when there are lots of systems in this status, then you know there is a global problem.  That requires a different approach.

Very simple to put into place.  Very simple to use.  Lots of value add.

This is different than things like Ping a list of Systems to see if they are still.  This is using those systems to do the work.  Not often enough to cause performances because of running it, but often enough to catch issues.