Archive for the ‘dev’ Category

The “Cold Feet” Phase of a New Project

Sunday, April 24th, 2011

(yeah, I’m trying to start writing again, big woop, wanna fight about it?)

I was just musing on that weird early phase when you first create a new code tree and start doing minor exploratory hacking for an idea.

(musing is the perfect word for what this post is, as well.. don’t expect anything ground breaking; just me thinking out loud)

You don’t really have any complete units of functionality, so you don’t really want to commit it to a git repo (don’t want people to see the really early, stupid implementations); you don’t really want to start writing tests yet, because you know you’re going to throw most of these interfaces away an hour later anyway.

Minor aside: yes, I know, some of you write the tests before you even write the code. We disagree. I’m sorry.

You start to run the damn thing and dump out some output and start getting the results you want. You prune the code a little and get rid of some stupid glue. You move some functionality out to its own class or library or what-have-you. Things are working, your idea is starting to actually take shape.

The breaking point comes unexpectedly. You find a somewhat complicated, critical piece of logic that requires some refactoring. But now you’re worried – the code works as it is, or seems to, anyway. It’s time to write that test. It’s time to commit the existing code to a repo so that you have an audit trail of your refactor. All of a sudden, the project is starting to feel like a lot of work.

So am I wrong? Should I write the damn tests first like the TDD folks tell me?

Maybe. But I don’t think so. I think this blog post is ultimately about momentum. When you first start a fresh project based on an idea, you get on a roll quickly. You literally can’t write the code fast enough to keep all the ideas in your head. You’re cutting corners, sprinkling TODOs liberally for things that aren’t critical to just “get it working.” The problem with getting to that point where you have to start being responsible (“write ALL the tests!”) is that it saps you of that momentum.

My opinion, especially for personal projects, is that TDD robs you of that momentum before you’ve even started.

Uh oh, I’ve gone and made a statement about a religious engineering debate.

Ultimately, it comes down to commitment. Can you commit to doing the right thing, do you actually love this project that much? Is it worth spending your time polishing this thing instead of working on something else new and fun, or is it just going to end up being a polished turd?

In fact, this phase of a project is a lot like meeting someone you’re attracted to. Your gazes meet, 3 seconds pass, IT’S ON, nobody is thinking about tomorrow morning, it’s just all fun. Then a short amount of time passes and you realize, shit, I like where this is going, but how much?.. You start wondering if it’s worth passing up on repeating the previous 2 weeks with N others.. and shit, am I gonna have to shave my beard before I meet her mother? You have to decide if you’re going to commit.

(you see what I did there?)

Django Admin: Sorting on Related Object’s Property

Sunday, June 20th, 2010

This is mostly a note to myself for the future, but I’m sure someone else out there will find it helpful.

The Problem

In one of the apps I work on, I have a pretty standard setup for extending the User object in Django: I have a UserProfile model with a ForeignKey to the auth.User model and have AUTH_PROFILE_MODULE pointing at that model, so that the appropriate row gets returned when I call user.get_profile()

I’ve populated the UserProfile model with a bunch of fields that the client wanted to be able to see in the admin view. Nice and easy so far.

I was then asked to add the date the user joined the site to that view, and also make it sortable. “Easy!” I thought. I immediately went and added user__date_joined to the ModelAdmin‘s list_display Turns out it’s not that easy! Doing this causes a 500.

At first I was baffled that a simple relationship could not be traversed. I went into the Django source and figured out that the error was being thrown by admin.validation.validate() I took out the check to see what would happen if it the field were just allowed to be in there (I was half expecting it to just work, since putting user__date_joined in search_fields worked fine.. I thought maybe the check was accidentally too strict) and that’s when I understood why it’s not allowed: by allowing a field on another model to be included in a certain model’s admin class, an assumption is introduced that 1. that model/field have all the methods required to be displayed and 2. that those methods do what the writer of the ModelAdmin in question expects. Not a good idea.

The Solution

I brought this up in a django IRC channel, and Zain quickly suggested that I just add a callable to the ModelAdmin that simply returns the user’s date_joined. I already knew I could do that, and the reason I hadn’t was that I needed to be able to sort by that field – something that is impossible if it’s generated by a callable.

That’s when Zain pointed out admin_order_field to me – turns out you can tell the admin site to use a specific column to back sort queries against a callable. The resulting code looks like this:

class UserProfileAdmin(admin.ModelAdmin):
    list_select_related = True
    list_display = ( [...] 'date_joined')

    def date_joined(self, profile):
        return profile.user.date_joined

    date_joined.admin_order_field = 'user__date_joined'

That’s all there is to it. Enjoy!

“Fighting Spam at Flickr” at Web2.0Expo

Saturday, May 15th, 2010

I recently had the giddy honor of speaking at the 2010 Web2.0Expo in San Francisco. The topic was simple – spam. I shared some insights (or I hope they were insights, anyway) about combating the spam problem on a social website – something I had been doing quite a lot of since joining Flickr. The slides are now on Slideshare and embedded below.

Thanks to Brady and the rest of the w2e team for putting together a great conference. I didn’t get to go to as many sessions as I would have liked due to having to spend most of my time in the speakers lounge preparing, but the ones I did go to were excellent.

Things I forgot to say in the talk/slides that are important:

  • Keep track of recent rates for ALL activity that your users do. This gets a bit expensive in terms of storage, but if you prune the data furiously, it can be made sustainable. Having that information is key – it can be used at pretty much every step of spam mitigation. Also, be smart about this – if messages can be deleted from a table, don’t use that table to do the counting. Nobody I know has EVER done that……

  • Rate limit everything. There’s usually a sweetspot right between what 99% of real users will actually ever do and spam-land.

Anyway, here are the slides. Enjoy!

“Watch out for your CRON jobs” @ MySQL Performance Blog

Monday, October 19th, 2009

I had a post brewing in my head for a couple of weeks that was going to be titled “Keeping tabs on your crontabs,” inspired by some recent margins-of-the-day stuff I’d been doing. However, Peter Z at the MySQL Performance Blog has beaten me to it. I still maintain my title idea was better :) Check it out, it has some nice, easy-to-implement tips for making sure your crons don’t do stupid things and making your ops team happy (and we’re all about that at Flickr).

Watch out for your CRON jobs

Streamstats status update

Monday, October 19th, 2009

Just wanted to post a quick note about the state of streamstats, the little tool I’ve been working on for analyzing logs/data files. Things stalled a bit when I started trying to implement time awareness, as it turned out that Python’s time parsing capabilities are limited, to put it nicely. I even tried to use regex to find a matching pattern before parsing the date, but I was unable to parse common date formats found in the logs this tool is intended to parse (namely, apache logs; and no, changing the date format for all of Flickr’s hosts is not a fucking option, ok?) This was unacceptable.

I quickly recreated the basic functionality in PHP, using the famed strtotime function. However, then I looked at the getopt() implementation available stock with PHP and realized I was either going to have to package a third party option (the pickings there were also slim), write my own lib to do it, or write a whole shitload of custom code specifically for streamstats. However, the first option was not attractive due to the fact that I’d have to create a package for it for Yahoo!’s packaging system, and the other two are unattractive because… well, I’m trying to write a fucking stats analysis function, not options handling code.

That means streamstats is being rewritten, for the 3rd time. in Perl. I’ll be using Getopt::Long and Date::Manip to keep the auxiliary logic out of the script. Luckily, the basic functionality won’t take long to recreate, and the features I’ve been trying to add shouldn’t be too bad either. Plus, I get to finally re-learn Perl.

Things to look forward to:

  • Time awareness (i.e. calculating how many times per second a certain value occurs, a distribution of frequencies, limiting the timeframe of interest)
  • proper histograms, with buckets etc
  • custom patterns (i.e., being able to specify which part of the incoming string is the time and which part is the value, for those that don’t want to use grep/awk to narrow it down beforehand
  • multiple column comparison and relevant stats (gonna have to bust out the textbook on this one)

The next couple of months promise to be exciting in terms of shipping things, both at Flickr and outside. I’m looking forward to posting actual code for

Chiming in on Duck Typing vs Static Typing

Wednesday, September 23rd, 2009

Crowley and Malone have been going at it over duck typing vs static typing. I’ve decided to invite myself into the discussion. Crowley’s initial post is here. Malone’s response is here (read the comments as well) and Crowley’s follow up is here.

For static typing

Richard’s argument is that duck typing does not alert the developer that the wrong argument type had been supplied to a method until it is too late, if at all. His two examples are 1. passing a CreditCard to a template instead of a Kitten, resulting in the CreditCard’s number being displayed where a Kitten’s name should have shown up and 2. more interestingly, a method which does something scary up front (presumably, by ‘poke_a_tiger_with_a_stick()’ Richard means a persistent data write or a financial transaction), then iterates over one of its arguments and fails if the argument passed is not an iterable. Richard would rather know that the argument passed is not of the right type up front, so that the method would fail immediately. He suggests “turkey typing,” a C++ STL-like syntax for telling the compiler what the argument to a function should be. In his example, he uses an iterable containing only certain types of objects.

For duck typing

Mike’s argument is that duck typing provides much more flexibility than static typing – if you need an object to only fullfill one part of a certain interface, you can simply define that part of the interface, plug the object into a method that expects that specific part of the interface, and be on your merry way. Further, Malone wants exactly what Richard doesn’t – to be able to pile all sorts of different types of objects into a collection. Another important point he brings up in the comments is that duck typing allows many different developers to come up with many different solutions to a problem defined by an interface without having to share any common ground at all. As long as the same interfaces are implemented, the objects can be swapped in and out. Free-flyin polymorphism.

Does “Turkey Typing” work?

Richard’s proposal is to specify the type of argument up front – use some syntax to say that an argument should only be of a certain type or an iterable with only certain types in it, and everything is peachy: if you pass the wrong type, the compiler yells at you before you get to any logic.

However, this lacks flexibility in the same way that static typing does. It effectively encourages the creation of a whitelist for what sorts of things should be allowed into the method. Interfaces (the language constructs) can mitigate most of this, but only if the developer of the method uses an interface – extra boilerplate code.

Let’s take Flickr as an example. Initially, users could only upload photos. Using turkey typing, we’d have functions that would expect an object of type Photo. Then Flickr added video. We would have to go through all the functions that deal with generic actions – uploading, describing, deleting, etc – and edit the signature to also include Video objects. Of course, we could create a superclass, but then all of a sudden this doesn’t seem very different from the statically typed world Malone describes – lots of useless code just to be able to do simple things.

Both sides are interested in the same thing – an object’s capabilities. Malone is fine either A. assuming that people know what they’re doing or B. checking the capabilities manually. Crowley wants that check to be contained in the signature. The problem is, he wants to discover the capabilities by looking at a name. It’s equivalent to using the UserAgent string of a browser to try to figure out what code it can and can’t handle. It just doesn’t go well. If Richard’s syntax is to accomplish what it’s really after, it’ll have to become so complicated, that it’s basically not worth the trouble.

The tiger argument

Let’s turn Richard’s example into something more realistic for a second. Say you have a function called purchase_items(), which takes as arguments a user object an array of items, puts those items into a shipping queue, and charges the user’s credit card. poke_a_tiger_with_a_stick() in this case is charging the credit card, and the loop that follows is the insertion of items into a shipping queue.

Richard is right to say that if you charge the credit card, and then the rest of the function fails because some idiot dev you hired straight out of college passed something stupid instead of the array of items (maybe a single item), your customer will be pretty pissed.

The problem with this thought experiment is that it uses shitty coding as its premise. If you charge the credit card before you’re sure you’ve verified your data and successfully recorded the order, you deserve that angry support phonecall. You should always save the critical transactions for for last, and I know that Richard knows this. You’ll never see him poking a tiger with a stick as the first order of business. Simply restructuring the method would surface the problem with the argument immediately.

Where I stand

I think Malone’s assessment is correct: static typing and permutations thereof ultimately aim at saving people from doing stupid things. However, there is plenty of software out there written in statically typed languages that does stupid things. You can’t help stupid. You can, however, write your code in a way that minimizes its impact, but you don’t need typing for that, just common sense.

Meanwhile, loosely typed languages allow allow for great flexibility and efficiency.


I think it’s worth noting that Crowley spends most of his time writing C and C++, whereas Malone and I work mostly in Python and PHP, respectively. I’m pretty sure that all of our preferences just line up with what we’re comfortable with.

Pull random data out of MySQL without making it cry – using and optimizing ORDER BY, LIMIT, OFFSET etc.

Wednesday, September 23rd, 2009


This isn’t a rocket science post. Your world view will not be changed by it. It’s just a trick I came up with while pulling random data out of a database that gives a pretty good mixture of results without straining the server.


  • Data that is random enough that a user will be entertained for at least the first 3 or 4 reloads
  • Low cost to the database

Ways not to do it

If your immediate response to seeing this post is “DUH, use ORDER BY RND(),” you are fired.

Step 1: random subset of non-random data

If you only need a few items (N), the easiest thing to do is to just grab N*k rows and pick randomly out of that set. In pseudo-PHP, it looks about like this:

    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $q = "SELECT * FROM bicycles LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));

The downside here is that your overall result set is essentially always the same. We can do better, though.

Step 2: random ORDER BY criteria

Examine the following simple table from our fictional bike shop

mysql> show create table bicycle \G
*************************** 1. row ***************************
       Table: bicycle
Create Table: CREATE TABLE `bicycle` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `frame` tinyint(3) unsigned NOT NULL default '0',
  `in_stock` tinyint(3) unsigned NOT NULL default '0',
  `date_available` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `available_and_when` (`in_stock`,`date_available`)

Assuming you want to show random bicycles that are in stock, the query from above changes to

$q = "SELECT * FROM bicycles WHERE in_stock = 1 LIMIT $limit";

However, notice the available_and_when key: it includes in_stock as the left most column, but also date_available. Since we’re already using the leftmost column in our query, we can sort using the second column while still using the index – FO FREE*!

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5;
| id  | frame | in_stock | date_available |
... SNIP ...
5 rows in set (0.00 sec)

mysql> explain select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bicycle
         type: ref
possible_keys: available_and_when
          key: available_and_when
      key_len: 1
          ref: const
         rows: 86900
        Extra: Using where
1 row in set (0.00 sec)

Why does this matter? It means that you can take advantage of this when grabbing your set of non-random data. You can randomly pick which way you’re going to sort the table, then grab the $n*$k rows. Our code snippet becomes

    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_by = $order_candidates[array_rand($order_candidates)];
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));

Depending on your query and the indexes you have available to you, you could have numerous sorting possibilities. The more different ways you can sort, the more random your data will be. If the randomness of this data is truly important, you can add indexes to accommodate more ways of ordering.

DO NOT BE A COCK! Run every single possibility through explain. If you fuck up and order by something stupid, your DBA/Ops will murder you, and your site will be balls slow.

Step 3: Random OFFSET

This could very well have been Step 2, but I like this part less, so it goes later. This part requires a more extensive use of common sense, thus leaves more room for stupid.

Now that you’ve got some different ways you could order the data, you could also broaden your range by specifying a random offset. There are two caveats:

  1. You don’t always know how many rows you have in total, and finding out could be expensive
  2. You don’t want too high of an offset, because that will be slow

Since #2 means you should never even THINK about #1, I guess there’s really only the one caveat.

What do I mean? Let’s look at some queries against my measly 500k rows table:

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 0;
... SNIP ...
5 rows in set (0.00 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 5000;
... SNIP ...
5 rows in set (0.07 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 50000;
... SNIP ...
5 rows in set (1.15 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 100000;
... SNIP ...
5 rows in set (1.97 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 250000;
... SNIP ...
5 rows in set (4.89 sec)

The reason for the degrading performance is that MySQL has to generate the entire result set, then throw away everything below the offset. However, we can see that the performance is tolerable up to a point. We can certainly afford to pick from the first few thousand rows (or hundred, depending on your requirements).

Let’s implement the random offset:

    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_rnd = array_rand($order_candidates); # separated for better cacheability
    $order_by = $order_candidates[$order_rnd];
    $offset_rnd = rand(0, 20); # separated for better cacheability
    $offset = $offset_rnd * 100;
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit OFFSET $offset";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));

Working around OFFSET performance degradation

If you just can’t sleep well, knowing that you are not pulling rows from the entire table, you should really seek professional help – something far beyond whatever I can provide you with. Seriously, who gives a shit? However, I do have something to tide you over. Among other places, I’ve seen this workaround suggested in this MySQL Performance Blog Post. If you have a sequential column you can use (i.e., that is included in an index you can use for your query), you could use a WHERE col BETWEEN x AND y clause to simulate random offsets. Caveat: data may not be evenly distributed, so you might hit a range that has no rows in it, or not enough rows for your random sample.

For this scenario, MySQL does do some very handy things. If your table looks like mine, i.e. has an index on the column we’re selecting by AND the column we want to use for our offset, you can get the MIN and MAX bounds FO FREE**:

mysql> select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1;
| MAX(date_available) | MIN(date_available) |
|          1253385344 |          1249496986 | 
1 row in set (0.00 sec)

mysql> explain select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away <<< fucking sweet

Once you have the minimum and maximum timestamps, you can just pick a random range. As I said above, you could get yourself an empty result set, so be sure you only do this with uniformly distributed data and a wide enough range.

A final note: don’t forget to fucking cache

Although the entire purpose of this post is to show how to do these things in a cheap way, it’s still important to remember: MySQL time is much more expensive than memcache time. For something like this, you should definitely be caching. The strategy is very simple: generate your query params (limit, order direction, offset), then use those to build your key. Check that you don’t already have this result set in cache; if you don’t, query the database and store it in the cache using that key. The next time you get the same random criteria, you’ll be able to get the rows – FO FREE***. For our last code example, the cache key could look like this:

$key = "BicycleRandom_$limit_{$order_rnd}_{$offset_rnd}";

That’ll do it!

* By ‘FO FREE’ I mean free, aside from the cost of maintaining the index
** This time, I really did mean FREE
*** Or almost free, anyway

streamstats: easier log analysis

Saturday, August 22nd, 2009

(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)


After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.

The code can be found on github.

Basic Functionality

Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)

Output might look like this (fake info):

awk '{print $1}' access.log | |                                        1 |                                        1 ||||                                     4 ||||                                     4 |                                        1 |                                        1 |||||                                    5 ||||                                     4 |||||||||||                              11 * ||||||                                   6 |||                                      3 |                                        1

Some Statsy Things count 12 outliers 1 maximum 11 minimum 1 stdev 2.84312035154 total 42 mean 3.5

As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.

Moar! Time Awareness

The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.

Implementation Possibilities

The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?

Candidate #1: awk and friends

My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.

Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.

Candidate #2: Regular Expressions

Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:

  • -t –time to let streamstats know that we’re interested in time, with an optional argument that would be the regex pattern used to find the time
  • -p –pattern to specify the pattern used to actually parse the value out of every line of input

However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.

Instead, I think it might be easier to take advantage of named groups and add the following arguments:

  • -t –time to let streamstats know that we’re interested in time
  • -p –pattern to specify the regular expression that will parse out both the time and the value. In the absence of -t, the default pattern would be ^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.

The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.

I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.

Django: exposing settings in templates, the easy way

Sunday, July 26th, 2009

NOTE: I have submitted a patch to have this functionality as part of Django core. I’m posting this code here in case the patch gets rejected, since similar motions have previously been rejected. The ticket is at, and I will update this post if the patch gets accepted.

I find it very handy to be able to expose parts of in templates. The Django docs show you how to do this using context processors and give the code for the MEDIA_URL context processor here. However, it seems silly to me to have to write custom processors all the time. Instead, I propose a single processor that exposes a list of settings (stored in itself) to the template layer.

The gist containing the code is here:


daemontools 0.76 on Ubuntu 9.04

Sunday, July 5th, 2009

I find daemontools to be incredibly useful for quickly turning scripts into daemons. I also like to run the latest version of Ubuntu, which means there is no (well-maintained) package available. There’s a package in ‘universe’ but it puts things in strange places and generally inspires distrust (explained below). As installing daemontools from source on Ubuntu hasn’t been straight forward (I had trouble both on 8.10 and 9.04), I’m gonna post a quick howto which pulls together a few pieces of information scattered throughout the internets.

Crash Course

daemontools comes with a bunch of tools. The full docs are here, but I’ll cover the basics.

  • supervise is a daemon that takes a directory as an argument and tries to execute the file named ‘run’ in that directory, while keeping state in the ‘supervise’ subdirectory. Naturally, you want to give it a folder with an executable file named ‘run’ and a wirtable subdirectory called ‘supervise’ in it.
  • svstat, svok, svc, etc are used to manipulate individual supervise jobs and check their status. These aren’t covered here
  • svscan is a daemon that also takes a directory as an argument and proceeds to look for supervise-able directories within. Any time it finds one that doesn’t already have a corresponding supervise process, it fires one up. Thus if the supervise instance you count on to do some background processing crashes for a whatever reason, svscan will just start it back up
  • svscanboot is the script that should be run at boot-time to start svscan. It will direct svscan to look for jobs in the /service directory.

In an ideal world, you throw a bunch of shit into /service, svscan picks it up and fires up a supervise process for all the subdirectories. It runs as a daemon and restarts supervise whenever needed (svscan itself is basically bulletproof in my experience). When your machine is rebooted for whatever reason, svscanboot restarts svscan and everything resumes as before. The reason I choose not to use the Ubuntu package is because it does not create this ideal world for me.


First of all, daemontools won’t compile as is. Follow along the instructions here until you get to the exciting part where you type ‘package/install’ and then get an error message that looks like this:

/usr/bin/ld: errno: TLS definition in /lib/ section .tbss
mismatches non-TLS reference in envdir.o
/lib/ could not read symbols: Bad value
collect2: ld returned 1 exit status
make: *** [envdir] Error 1

Never fret. I have found a couple of possible fixes, but the patch below appears to be the cleanest one. [1]

diff -ur daemontools-0.76.old/src/error.h daemontools-0.76/src/error.h
--- daemontools-0.76.old/src/error.h    2001-07-12 11:49:49.000000000 -0500
+++ daemontools-0.76/src/error.h    2003-01-09 21:52:01.000000000 -0600
@@ -3,7 +3,7 @@
#ifndef ERROR_H
#define ERROR_H

-extern int errno;
+#include <errno.h>

extern int error_intr;
extern int error_nomem;

Assuming you are still at the point in the installation instructions right before the ‘package/install’ command, copy the patch into /tmp/daemontools-errno.patch, then

$ cd src
$ patch < /tmp/daemontools-errno.patch
$ cd ..
Then resume the original daemontools instructions as prescribed.


The installation creates two folders: /command and /service. The former contains all the different daemontools commands, while the latter is the default folder which the svscan daemon scans for jobs. To test that our installation worked, let’s create a proper test job (you’ll probably want to be superuser for this, since /service is owned by root by default)

$ cd /service
$ mkdir tester
$ mkdir tester/supervise
$ touch tester/run
$ chmod +x tester/run
in tester/run, put any script. It can be empty, but if you really want to verify things are working, make it output something to a tmp file. Sample:


echo "Foo!" >> /tmp/foo.log
sleep 5

Once everything is in place, as superuser run the following commands. If your output looks different, something’s broke.

$ /command/svscan /service &
[1] 9641
$ ps -ef | grep svscan
root      9641 17108  0 18:12 pts/1    00:00:00 /command/svscan /service
mikhailp  9646 17108  0 18:12 pts/1    00:00:00 grep svscan
$ ps -ef | grep supervise
root      9642  9641  0 18:12 pts/1    00:00:00 supervise tester
mikhailp  9660 17108  0 18:12 pts/1    00:00:00 grep supervise
If you made your tester script log something, check that the output is showing up where expected. If all is well, shut ‘er down and move on.
$ killall svscan
$ killall supervise

Starting at Bootup

Assuming you want to user supervise for some critical offline part of your application, you want it to start automatically whenever your server boots up. You don’t want to have to manually start up svscan every time.

Ubuntu uses upstart for task management instead of sysvinit. In order to let upstart start and stop the svscan daemon correctly, the following script needs to be copied into /etc/event.d/svscan [2]

# svscan - daemontools
# This service starts daemontools from the point the system is
# started until it is shut down again.

start on startup

start on runlevel 1
start on runlevel 2
start on runlevel 3
start on runlevel 4
start on runlevel 5
start on runlevel 6

stop on shutdown

exec /command/svscanboot

Once that’s in place, you can use start svscan and stop svscan. Try starting it and checking ps output as described in the Testing section. Finally, reboot the box (if possible) and repeat the same checks. At this point, svscan should start up with your system and spawn supervise processes on any job it finds in /service.

That’s all there is to it. Enjoy.

1: Patch originally found here, linked from here. 2: Corrected from this entry posted in 2006 which has some obsolete info