spaceblog

Monitoring Cron Jobs

(This article comes from one I helped edit and publish inside work, so I can’t take any credit for the ideas expressed within, though I do vehemently and violently subscribe to the sentiment! Thanks to Alan Sundell for originally educating me.)

When you set (or don’t set at all) MAILTO in a crontab fragment, typically it’s because you want to be notified if your job fails – failure in this case if and only if the job only prints to stdout/stderr if there is an exceptional condition… However not all jobs print only on exceptional conditions, many use stderr for logging, and email is just not a great solution to this problem, especially at scale.

Problems with alerting by email from cron.

Why is it a bad idea to rely on cron mail?

  • We all get so much mail from cron that few are in the habit of reading it anymore.
  • If your server is broken, the mail submission agent may be broken too.
  • You may handle your cron mail, but you may be on holiday when it arrives.
  • crond can crash.
  • You will get mutiple emails for the same failure.
  • Your cron mail will get delivered to every one of your mailboxes, eating up storage.
  • You cannot suppress cron mail notifications.
  • Your cron mail has no concept of dependencies.
  • You will get notified of temporary failures when you only care about persistent ones.

If a cronjob running successfully is critical to operation, then it seems that what you really need is some kind of monitoring system that addresses all of these things, and can send alerts to some oncall rotation that determines who is responsible for handling alerts.

A potential solution

Here’s an idea that might help with that.

  1. Direct the output of your job to some log file for debugging, in the event of persistent failure. Note the truncate:

    MAILTO=""
    */1 * * * * root cronjob > /var/log/cronjob.log 2>&1
    

    (If you decide to append, not overwrite the log each execution, then make sure you logrotate that file.)

  2. At the end of cronjob, update a status file, like so:

    scriptname=$(basename $0)
    date +%s > /var/tmp/cronjob-last-success-timestamp
    

    Ensure that your job exits on error before reaching the last line!

  3. Collect the content of that file regularly with your monitoring system; scrape it with the nagios host agent, pump it into collectd, whatever you hip open source cats are using these days.

  4. Configure your monitoring system to send a notification on the timestamp having not been updated in some time period.

    if cronjob-last-success-timestamp
      <
    (time() - 30m)
      then alert
    
  5. Profit!

Now you only generate an alert if the cron job hasn’t succeeded in the last 30 minutes (a threshold you can adjust to match your monitoring scrape intervals and service SLAs), and with a sufficiently mature monitoring system you can now express dependencies, suppress the notification, and send it to an oncall rotation, and so on!

Most significantly, we have converted a system that always reported failure, into a system that is based around checking for success – a failsafe.

I, for one, welcome our new enhancement proposals

I just read with excitement the announcement of Debian Enhancement Proposals, something that I too have been contemplating in recent months (but due to my ghastly lack of commitment to the Debian community doubted my ability to drive it).

I work in company driven by engineering documents and designs, I like RFCs, and I like what Python has done with its PEPs. Debian’s adoption of this useful tool can only improve the community and the distribution.

Yay!

nsscache open source launch

Today we open sourced the project I’ve been working on for the last 9 months, nsscache.

It’s a glorified version of:

ldapsearch | awk > /etc/passwd

in that we in theory support more than just LDAP as a data source, and offer two types of database storage (nss_db using Berkeley DB, and plain text files).

If you’re having issues with your nss_ldap setup, then try it out :)

linux.conf.au 2007 programme choices

As the clock ticks approach single digit figures, the organising team are ramping up. Everything’s coming along smoothly in time for a kick-arse start on Monday.

One thing that makes me sad is that I’ll likely not be able to watch a lot of the talks – but if I do get a chance, I’d see these:

  • clustering tdb by Andrew Tridgell. When I first saw this proposal came in, I knew it was a good one. Someone should make an LDAP server that doesn’t suck, and use tdb to solve the multimaster replication problem in the storage layer. Oh wait – that’s what he’s doing already.

  • Puppet by Luke Kanies. Luke’s been working on this awesome next-generation systems configuration management tool for a few years now. He approached me, back in the day, to be a beta-tester – he and I were both hitting scaling problems with cfengine. I hope every sysadmin makes it to this talk!

There’s a few others to note: Theodore T’so always has an interesting talk; this year he’s giving two cool subjects a run. The tutorial on heartbeat 2 by Alan Robertson is sure to be full of good loadbalancing fu.

There’s lots of exciting things going on in the programme, so I hope to see you all next week!