Archive for September, 2009

Chiming in on Duck Typing vs Static Typing

Wednesday, September 23rd, 2009

Crowley and Malone have been going at it over duck typing vs static typing. I’ve decided to invite myself into the discussion. Crowley’s initial post is here. Malone’s response is here (read the comments as well) and Crowley’s follow up is here.

For static typing

Richard’s argument is that duck typing does not alert the developer that the wrong argument type had been supplied to a method until it is too late, if at all. His two examples are 1. passing a CreditCard to a template instead of a Kitten, resulting in the CreditCard’s number being displayed where a Kitten’s name should have shown up and 2. more interestingly, a method which does something scary up front (presumably, by ‘poke_a_tiger_with_a_stick()’ Richard means a persistent data write or a financial transaction), then iterates over one of its arguments and fails if the argument passed is not an iterable. Richard would rather know that the argument passed is not of the right type up front, so that the method would fail immediately. He suggests “turkey typing,” a C++ STL-like syntax for telling the compiler what the argument to a function should be. In his example, he uses an iterable containing only certain types of objects.

For duck typing

Mike’s argument is that duck typing provides much more flexibility than static typing – if you need an object to only fullfill one part of a certain interface, you can simply define that part of the interface, plug the object into a method that expects that specific part of the interface, and be on your merry way. Further, Malone wants exactly what Richard doesn’t – to be able to pile all sorts of different types of objects into a collection. Another important point he brings up in the comments is that duck typing allows many different developers to come up with many different solutions to a problem defined by an interface without having to share any common ground at all. As long as the same interfaces are implemented, the objects can be swapped in and out. Free-flyin polymorphism.

Does “Turkey Typing” work?

Richard’s proposal is to specify the type of argument up front – use some syntax to say that an argument should only be of a certain type or an iterable with only certain types in it, and everything is peachy: if you pass the wrong type, the compiler yells at you before you get to any logic.

However, this lacks flexibility in the same way that static typing does. It effectively encourages the creation of a whitelist for what sorts of things should be allowed into the method. Interfaces (the language constructs) can mitigate most of this, but only if the developer of the method uses an interface – extra boilerplate code.

Let’s take Flickr as an example. Initially, users could only upload photos. Using turkey typing, we’d have functions that would expect an object of type Photo. Then Flickr added video. We would have to go through all the functions that deal with generic actions – uploading, describing, deleting, etc – and edit the signature to also include Video objects. Of course, we could create a superclass, but then all of a sudden this doesn’t seem very different from the statically typed world Malone describes – lots of useless code just to be able to do simple things.

Both sides are interested in the same thing – an object’s capabilities. Malone is fine either A. assuming that people know what they’re doing or B. checking the capabilities manually. Crowley wants that check to be contained in the signature. The problem is, he wants to discover the capabilities by looking at a name. It’s equivalent to using the UserAgent string of a browser to try to figure out what code it can and can’t handle. It just doesn’t go well. If Richard’s syntax is to accomplish what it’s really after, it’ll have to become so complicated, that it’s basically not worth the trouble.

The tiger argument

Let’s turn Richard’s example into something more realistic for a second. Say you have a function called purchase_items(), which takes as arguments a user object an array of items, puts those items into a shipping queue, and charges the user’s credit card. poke_a_tiger_with_a_stick() in this case is charging the credit card, and the loop that follows is the insertion of items into a shipping queue.

Richard is right to say that if you charge the credit card, and then the rest of the function fails because some idiot dev you hired straight out of college passed something stupid instead of the array of items (maybe a single item), your customer will be pretty pissed.

The problem with this thought experiment is that it uses shitty coding as its premise. If you charge the credit card before you’re sure you’ve verified your data and successfully recorded the order, you deserve that angry support phonecall. You should always save the critical transactions for for last, and I know that Richard knows this. You’ll never see him poking a tiger with a stick as the first order of business. Simply restructuring the method would surface the problem with the argument immediately.

Where I stand

I think Malone’s assessment is correct: static typing and permutations thereof ultimately aim at saving people from doing stupid things. However, there is plenty of software out there written in statically typed languages that does stupid things. You can’t help stupid. You can, however, write your code in a way that minimizes its impact, but you don’t need typing for that, just common sense.

Meanwhile, loosely typed languages allow allow for great flexibility and efficiency.

Biases

I think it’s worth noting that Crowley spends most of his time writing C and C++, whereas Malone and I work mostly in Python and PHP, respectively. I’m pretty sure that all of our preferences just line up with what we’re comfortable with.

Pull random data out of MySQL without making it cry – using and optimizing ORDER BY, LIMIT, OFFSET etc.

Wednesday, September 23rd, 2009

Motivation

This isn’t a rocket science post. Your world view will not be changed by it. It’s just a trick I came up with while pulling random data out of a database that gives a pretty good mixture of results without straining the server.

Goals

  • Data that is random enough that a user will be entertained for at least the first 3 or 4 reloads
  • Low cost to the database

Ways not to do it

If your immediate response to seeing this post is “DUH, use ORDER BY RND(),” you are fired.

Step 1: random subset of non-random data

If you only need a few items (N), the easiest thing to do is to just grab N*k rows and pick randomly out of that set. In pseudo-PHP, it looks about like this:

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $q = "SELECT * FROM bicycles LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

The downside here is that your overall result set is essentially always the same. We can do better, though.

Step 2: random ORDER BY criteria

Examine the following simple table from our fictional bike shop

mysql> show create table bicycle \G
*************************** 1. row ***************************
       Table: bicycle
Create Table: CREATE TABLE `bicycle` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `frame` tinyint(3) unsigned NOT NULL default '0',
  `in_stock` tinyint(3) unsigned NOT NULL default '0',
  `date_available` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `available_and_when` (`in_stock`,`date_available`)
) ENGINE=InnoDB AUTO_INCREMENT=510241 DEFAULT CHARSET=utf8

Assuming you want to show random bicycles that are in stock, the query from above changes to

$q = "SELECT * FROM bicycles WHERE in_stock = 1 LIMIT $limit";

However, notice the available_and_when key: it includes in_stock as the left most column, but also date_available. Since we’re already using the leftmost column in our query, we can sort using the second column while still using the index – FO FREE*!

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5;
+-----+-------+----------+----------------+
| id  | frame | in_stock | date_available |
+-----+-------+----------+----------------+
... SNIP ...
+-----+-------+----------+----------------+
5 rows in set (0.00 sec)

mysql> explain select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bicycle
         type: ref
possible_keys: available_and_when
          key: available_and_when
      key_len: 1
          ref: const
         rows: 86900
        Extra: Using where
1 row in set (0.00 sec)

Why does this matter? It means that you can take advantage of this when grabbing your set of non-random data. You can randomly pick which way you’re going to sort the table, then grab the $n*$k rows. Our code snippet becomes

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_by = $order_candidates[array_rand($order_candidates)];
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

Depending on your query and the indexes you have available to you, you could have numerous sorting possibilities. The more different ways you can sort, the more random your data will be. If the randomness of this data is truly important, you can add indexes to accommodate more ways of ordering.

DO NOT BE A COCK! Run every single possibility through explain. If you fuck up and order by something stupid, your DBA/Ops will murder you, and your site will be balls slow.

Step 3: Random OFFSET

This could very well have been Step 2, but I like this part less, so it goes later. This part requires a more extensive use of common sense, thus leaves more room for stupid.

Now that you’ve got some different ways you could order the data, you could also broaden your range by specifying a random offset. There are two caveats:

  1. You don’t always know how many rows you have in total, and finding out could be expensive
  2. You don’t want too high of an offset, because that will be slow

Since #2 means you should never even THINK about #1, I guess there’s really only the one caveat.

What do I mean? Let’s look at some queries against my measly 500k rows table:

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 0;
... SNIP ...
5 rows in set (0.00 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 5000;
... SNIP ...
5 rows in set (0.07 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 50000;
... SNIP ...
5 rows in set (1.15 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 100000;
... SNIP ...
5 rows in set (1.97 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 250000;
... SNIP ...
5 rows in set (4.89 sec)

The reason for the degrading performance is that MySQL has to generate the entire result set, then throw away everything below the offset. However, we can see that the performance is tolerable up to a point. We can certainly afford to pick from the first few thousand rows (or hundred, depending on your requirements).

Let’s implement the random offset:

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_rnd = array_rand($order_candidates); # separated for better cacheability
    $order_by = $order_candidates[$order_rnd];
    $offset_rnd = rand(0, 20); # separated for better cacheability
    $offset = $offset_rnd * 100;
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit OFFSET $offset";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

Working around OFFSET performance degradation

If you just can’t sleep well, knowing that you are not pulling rows from the entire table, you should really seek professional help – something far beyond whatever I can provide you with. Seriously, who gives a shit? However, I do have something to tide you over. Among other places, I’ve seen this workaround suggested in this MySQL Performance Blog Post. If you have a sequential column you can use (i.e., that is included in an index you can use for your query), you could use a WHERE col BETWEEN x AND y clause to simulate random offsets. Caveat: data may not be evenly distributed, so you might hit a range that has no rows in it, or not enough rows for your random sample.

For this scenario, MySQL does do some very handy things. If your table looks like mine, i.e. has an index on the column we’re selecting by AND the column we want to use for our offset, you can get the MIN and MAX bounds FO FREE**:

mysql> select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1;
+---------------------+---------------------+
| MAX(date_available) | MIN(date_available) |
+---------------------+---------------------+
|          1253385344 |          1249496986 | 
+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> explain select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away <<< fucking sweet

Once you have the minimum and maximum timestamps, you can just pick a random range. As I said above, you could get yourself an empty result set, so be sure you only do this with uniformly distributed data and a wide enough range.

A final note: don’t forget to fucking cache

Although the entire purpose of this post is to show how to do these things in a cheap way, it’s still important to remember: MySQL time is much more expensive than memcache time. For something like this, you should definitely be caching. The strategy is very simple: generate your query params (limit, order direction, offset), then use those to build your key. Check that you don’t already have this result set in cache; if you don’t, query the database and store it in the cache using that key. The next time you get the same random criteria, you’ll be able to get the rows – FO FREE***. For our last code example, the cache key could look like this:

$key = "BicycleRandom_$limit_{$order_rnd}_{$offset_rnd}";

That’ll do it!

* By ‘FO FREE’ I mean free, aside from the cost of maintaining the index
** This time, I really did mean FREE
*** Or almost free, anyway