Monitoring Cron Jobs

Monitoring Cron Jobs

(This article comes from one I helped edit and publish inside work, so I can’t take any credit for the ideas expressed within, though I do vehemently and violently subscribe to the sentiment! Thanks to Alan Sundell for originally educating me.)

When you set (or don’t set at all) MAILTO in a crontab fragment, typically it’s because you want to be notified if your job fails – failure in this case if and only if the job only prints to stdout/stderr if there is an exceptional condition… However not all jobs print only on exceptional conditions, many use stderr for logging, and email is just not a great solution to this problem, especially at scale.

Problems with alerting by email from cron.

Why is it a bad idea to rely on cron mail?

  • We all get so much mail from cron that few are in the habit of reading it anymore.
  • If your server is broken, the mail submission agent may be broken too.
  • You may handle your cron mail, but you may be on holiday when it arrives.
  • crond can crash.
  • You will get mutiple emails for the same failure.
  • Your cron mail will get delivered to every one of your mailboxes, eating up storage.
  • You cannot suppress cron mail notifications.
  • Your cron mail has no concept of dependencies.
  • You will get notified of temporary failures when you only care about persistent ones.

If a cronjob running successfully is critical to operation, then it seems that what you really need is some kind of monitoring system that addresses all of these things, and can send alerts to some oncall rotation that determines who is responsible for handling alerts.

A potential solution

Here’s an idea that might help with that.

  1. Direct the output of your job to some log file for debugging, in the event of persistent failure. Note the truncate:

    MAILTO=""
    */1 * * * * root cronjob > /var/log/cronjob.log 2>&1
    

    (If you decide to append, not overwrite the log each execution, then make sure you logrotate that file.)

  2. At the end of cronjob, update a status file, like so:

    scriptname=$(basename $0)
    date +%s > /var/tmp/cronjob-last-success-timestamp
    

    Ensure that your job exits on error before reaching the last line!

  3. Collect the content of that file regularly with your monitoring system; scrape it with the nagios host agent, pump it into collectd, whatever you hip open source cats are using these days.

  4. Configure your monitoring system to send a notification on the timestamp having not been updated in some time period.

    if cronjob-last-success-timestamp
      <
    (time() - 30m)
      then alert
    
  5. Profit!

Now you only generate an alert if the cron job hasn’t succeeded in the last 30 minutes (a threshold you can adjust to match your monitoring scrape intervals and service SLAs), and with a sufficiently mature monitoring system you can now express dependencies, suppress the notification, and send it to an oncall rotation, and so on!

Most significantly, we have converted a system that always reported failure, into a system that is based around checking for success – a failsafe.