Posts Tagged ‘mysql’

InnoDB Primary Key Clustering In Action @ Flickr

Sunday, May 8th, 2011

InnoDB clustered indexes is one of the most powerful, yet poorly understood optimizations.

The idea is that InnoDB will try to place items in primary key sort order together on disk. Thus, if your primary key is a numeric column called id, rows with id 1, 2, and 3 will be adjacent on disk, as will rows with id 21, 22, 23 etc. If you’re fetching rows that are adjacent when sorted on primary key, they will be adjacent on disk, resulting in sequential disk access. This sort of thinking is important when you make the decision to put on your big-boy pants and not just put your entire data set in memory.

The part that people don’t often think about is that composite primary keys follow the same rule. Tim Denike, Flickr’s DBA, demonstrates just how powerful the effect of clustered indexes can be:

The trick, as explained in the comments on the photo, is that most queries on Flickr end up being queries for multiple photos by the same owner (I suspect narcissism has a lot to do with that..) Thus if you have an index on owner_id, photo_id, the fact that the owner_id is the prefix causes photos from the same owner to be placed together on disk.

Obviously, since not all the photos are written to the database at the same time, some fragmentation is inevitable in the real world; nonetheless, it’s clearly a huge performance win. I believe some routine maintenance commands (OPTIMIZE? I can never remember all the different commands, someone will have to fact-check this..) will effectively “defrag” the table, moving all the rows into sequential primary-key order.

InnoDB clustering is covered in great detail in Ch. 3 of “High Performance MySQL

Pull random data out of MySQL without making it cry – using and optimizing ORDER BY, LIMIT, OFFSET etc.

Wednesday, September 23rd, 2009

Motivation

This isn’t a rocket science post. Your world view will not be changed by it. It’s just a trick I came up with while pulling random data out of a database that gives a pretty good mixture of results without straining the server.

Goals

  • Data that is random enough that a user will be entertained for at least the first 3 or 4 reloads
  • Low cost to the database

Ways not to do it

If your immediate response to seeing this post is “DUH, use ORDER BY RND(),” you are fired.

Step 1: random subset of non-random data

If you only need a few items (N), the easiest thing to do is to just grab N*k rows and pick randomly out of that set. In pseudo-PHP, it looks about like this:

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $q = "SELECT * FROM bicycles LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

The downside here is that your overall result set is essentially always the same. We can do better, though.

Step 2: random ORDER BY criteria

Examine the following simple table from our fictional bike shop

mysql> show create table bicycle \G
*************************** 1. row ***************************
       Table: bicycle
Create Table: CREATE TABLE `bicycle` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `frame` tinyint(3) unsigned NOT NULL default '0',
  `in_stock` tinyint(3) unsigned NOT NULL default '0',
  `date_available` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `available_and_when` (`in_stock`,`date_available`)
) ENGINE=InnoDB AUTO_INCREMENT=510241 DEFAULT CHARSET=utf8

Assuming you want to show random bicycles that are in stock, the query from above changes to

$q = "SELECT * FROM bicycles WHERE in_stock = 1 LIMIT $limit";

However, notice the available_and_when key: it includes in_stock as the left most column, but also date_available. Since we’re already using the leftmost column in our query, we can sort using the second column while still using the index – FO FREE*!

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5;
+-----+-------+----------+----------------+
| id  | frame | in_stock | date_available |
+-----+-------+----------+----------------+
... SNIP ...
+-----+-------+----------+----------------+
5 rows in set (0.00 sec)

mysql> explain select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bicycle
         type: ref
possible_keys: available_and_when
          key: available_and_when
      key_len: 1
          ref: const
         rows: 86900
        Extra: Using where
1 row in set (0.00 sec)

Why does this matter? It means that you can take advantage of this when grabbing your set of non-random data. You can randomly pick which way you’re going to sort the table, then grab the $n*$k rows. Our code snippet becomes

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_by = $order_candidates[array_rand($order_candidates)];
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

Depending on your query and the indexes you have available to you, you could have numerous sorting possibilities. The more different ways you can sort, the more random your data will be. If the randomness of this data is truly important, you can add indexes to accommodate more ways of ordering.

DO NOT BE A COCK! Run every single possibility through explain. If you fuck up and order by something stupid, your DBA/Ops will murder you, and your site will be balls slow.

Step 3: Random OFFSET

This could very well have been Step 2, but I like this part less, so it goes later. This part requires a more extensive use of common sense, thus leaves more room for stupid.

Now that you’ve got some different ways you could order the data, you could also broaden your range by specifying a random offset. There are two caveats:

  1. You don’t always know how many rows you have in total, and finding out could be expensive
  2. You don’t want too high of an offset, because that will be slow

Since #2 means you should never even THINK about #1, I guess there’s really only the one caveat.

What do I mean? Let’s look at some queries against my measly 500k rows table:

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 0;
... SNIP ...
5 rows in set (0.00 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 5000;
... SNIP ...
5 rows in set (0.07 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 50000;
... SNIP ...
5 rows in set (1.15 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 100000;
... SNIP ...
5 rows in set (1.97 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 250000;
... SNIP ...
5 rows in set (4.89 sec)

The reason for the degrading performance is that MySQL has to generate the entire result set, then throw away everything below the offset. However, we can see that the performance is tolerable up to a point. We can certainly afford to pick from the first few thousand rows (or hundred, depending on your requirements).

Let’s implement the random offset:

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_rnd = array_rand($order_candidates); # separated for better cacheability
    $order_by = $order_candidates[$order_rnd];
    $offset_rnd = rand(0, 20); # separated for better cacheability
    $offset = $offset_rnd * 100;
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit OFFSET $offset";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

Working around OFFSET performance degradation

If you just can’t sleep well, knowing that you are not pulling rows from the entire table, you should really seek professional help – something far beyond whatever I can provide you with. Seriously, who gives a shit? However, I do have something to tide you over. Among other places, I’ve seen this workaround suggested in this MySQL Performance Blog Post. If you have a sequential column you can use (i.e., that is included in an index you can use for your query), you could use a WHERE col BETWEEN x AND y clause to simulate random offsets. Caveat: data may not be evenly distributed, so you might hit a range that has no rows in it, or not enough rows for your random sample.

For this scenario, MySQL does do some very handy things. If your table looks like mine, i.e. has an index on the column we’re selecting by AND the column we want to use for our offset, you can get the MIN and MAX bounds FO FREE**:

mysql> select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1;
+---------------------+---------------------+
| MAX(date_available) | MIN(date_available) |
+---------------------+---------------------+
|          1253385344 |          1249496986 | 
+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> explain select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away <<< fucking sweet

Once you have the minimum and maximum timestamps, you can just pick a random range. As I said above, you could get yourself an empty result set, so be sure you only do this with uniformly distributed data and a wide enough range.

A final note: don’t forget to fucking cache

Although the entire purpose of this post is to show how to do these things in a cheap way, it’s still important to remember: MySQL time is much more expensive than memcache time. For something like this, you should definitely be caching. The strategy is very simple: generate your query params (limit, order direction, offset), then use those to build your key. Check that you don’t already have this result set in cache; if you don’t, query the database and store it in the cache using that key. The next time you get the same random criteria, you’ll be able to get the rows – FO FREE***. For our last code example, the cache key could look like this:

$key = "BicycleRandom_$limit_{$order_rnd}_{$offset_rnd}";

That’ll do it!

* By ‘FO FREE’ I mean free, aside from the cost of maintaining the index
** This time, I really did mean FREE
*** Or almost free, anyway

Drizzle: Time To Get Excited

Tuesday, March 3rd, 2009

Brian Aker was in San Francisco on Monday night to talk about to the MySQL and PHP Meetup groups about Drizzle, the fork of MySQL aimed at meeting the needs of large web applications.

I had heard various tidbits here and there about Drizzle, including at MySQL2008, and had a pretty vague impression of how it was actually going to be different. The general theme seemed to be simpler, smaller, and faster.

If everything Brian described is true, it’s those things plus FUCKING AWESOME. I will join my friends (here, for example) in foaming at the mouth about it. Below is a rough list of things Brian mentioned that get me all hot’n'bothered about Drizzle, in no particular order:

  • No Query Cache. The best part here is the reasoning for its absence: “If you’re relying on the query cache, you probably should have just used memcached to begin with.”
  • In-Query Sharding Info. I’m not exactly sure of the details of the implementation, but it sounds like Drizzle will make it possible to spray queries through a to the right shard automatically.
  • Pluggable Authentication. Finally. Authentication can also be completely turned off.
  • Serializable Query Plans. This feature will allow the parser to be bypassed entirely on most queries. You simply send the query to MySQL, get the execution plan back, cache that, and send the execution plan back the next time you need the same query. That is Fucking Bad Ass (TM).
  • Fewer Locks. Getting rid of a lot of the more advanced and rarely used features introduced in MySQL 5/5.1 (views, stored procedures, etc.), as well as some of the more basic stuff like authentication, has allowed Drizzle to lose 2/3 of the locks present in 5.0 (at least that’s what I think Brian said… sounds surreal…) which obviously opens the door for vast improvements on multicore architectures. It also sounded like Brian made the decision during his talk to discontinue MyISAM support in Drizzle to be able to get rid of another huge lock. I’m OK with it…
  • Discontinued support for antique hardware. Lots of code ripped out because it’s no longer needed.
  • Everything is pluggable, most things are optional. The Drizzle kernel is about 115kloc. Amazing.

Brian was very coy about benchmarks, leaving it to independent sources to run them, but it sounds like Drizzle will leave MySQL in the dust for most of the common applications seen in webapps. I can’t wait to try it and to use it on a few things.