Rube Goldberg Machine Contest 2010 on Flickr
Apparently my destiny is to wrangle cronjobs, because they are back in my life again.
I was discussing overlapping crons with a coworker, and we were discussing options for preventing that. I’ve long been a huge fan of
flock (1) for this purpose; since we were talking about a metrics-collecting cron that occasionally hung due to resource starvation and ended up spawning copy upon copy of itself, he suggested it might be easier to just make it timeout. He talked about forking a process and having the parent time the execution and killing it. “There has to be a simpler way,” I thought.
A bit of research on a sunday afternoon while nursing some minor whiplash revealed that there is. It also revealed that people are fucking crazy, far as I can tell anyway.
The “state of the art”
I tweeted about my search for a utility that would wrap a process and kill it if it was misbehaving. I got a bunch of responses back, including a suggestion to check out the sysadvent article about cron practices. I checked it out and found a shitload of random Ruby and Shell scripts. I know we all love writing our own code to do shit, but I always prefer decades-old C code do my bidding for me.
Since wrangling crons using previously-invented wheels appears to be a lost art, here’s my part for bringin’ it back.
flock(1) – prevent jobs from trampling on themselves
flock util is available on every Linux box I’ve ever logged onto; it is mind-numbingly simple to use. It has only a few options, most of which you only need if your case is special, like if your job operates on the file you’re using to lock. I don’t believe that’s the common case – usually, the lockfile is used to simply indicate that a process is already running. Your crontab line will look roughly like this:
* * * * * /usr/bin/flock -n /tmp/lockfileforyakshaver /usr/bin/yakshaver
This will cause this job to fail if another process already has a lock on the same file. The job will fail with code 1 if the lock can’t be acquired. Alternatively, you could allow flock to wait a few seconds (let’s say 5) before failing:
* * * * * /usr/bin/flock -w 5 /tmp/lockfileforyakshaver /usr/bin/yakshaver
That’s it. No need to write a custom script around it, no need to distribute this script to boxes and force your coworkers to read it to audit the crontab. All they have to do is man 1 flock. And you know this code works – it’s been around for fucking ever. Even if you need some more complicated behavior, you would be better off using flock inside a shellscript than writing your own implementation.
Note: this will obviously only work for jobs that only need to lock locally. Jobs that need to share, say, a database, will require something more complex. An article I’ve linked to before discusses what you might do with a MySQL database. You can also use something like Zookeeper for this more complicated usecase, but then you might have two problems. Just noticed Peter’s article references a PHP script he uses that essentially replicates the behavior of
timeout(1) – prevent jobs from running for too long
This utility is an administrator’s dream, far as I can see. Check out the two options it has:
-k, --kill-after=DURATION also send a KILL signal if COMMAND is still running this long after the initial signal was sent. -s, --signal=SIGNAL specify the signal to be sent on timeout. SIGNAL may be a name like `HUP' or a number. See `kill -l` for a list of signals
So you tell it what signal to use normally using
-s (defaults to
TERM), but you can then add a
-k option to make sure that if it doesn’t catch your drift on the first go around, it is swiftly
KILLed. How fucking nice, right? So instead of using flock, you might time your yakshaver:
* * * * * /usr/bin/timeout -k 59s 30s /usr/bin/yakshaver
This will send
TERM signal after 30 seconds and follow up with a full on kill after 59 seconds, presumably just in time for the next iteration of the cronjob. Nice and easy. And it’s all just part of linux, ain’t that fuckin’ great?
But I like ruby!
Yeah, that’s fine. However, these are succinct, declarative, straight forward options that require minimal cognitive overhead to understand and debug. There is enormous value in that. They are also part of what one might refer to as a canon – a long tradition of simple, short utilities that perform one specific task exceptionally well. While it’s en-vogue these days to throw out “the old ways” in favor of something shiny, we’re starting to see that maybe that’s not the best way to go. Simple is good, especially when the requirement is also simple.
Small aside: solving the problem from the other end
I realized as I was re-reading this post just before hitting “publish” that I would be remiss if I didn’t mention another solution we’re considering to the problem of monitoring crons being unable to run to completion: a company called Librato, which is run by some very smart friends of mine, offers a service called “SIlverline,” which wraps processes in containers that ensure that sufficient resources are always available. Think of it as a rigid allowance for the observer effect. It does quite a few other things, but I’ll let you make up your own mind, lest I start gushing: https://silverline.librato.com/.