Streamstats status update

October 19th, 2009

Just wanted to post a quick note about the state of streamstats, the little tool I’ve been working on for analyzing logs/data files. Things stalled a bit when I started trying to implement time awareness, as it turned out that Python’s time parsing capabilities are limited, to put it nicely. I even tried to use regex to find a matching pattern before parsing the date, but I was unable to parse common date formats found in the logs this tool is intended to parse (namely, apache logs; and no, changing the date format for all of Flickr’s hosts is not a fucking option, ok?) This was unacceptable.

I quickly recreated the basic functionality in PHP, using the famed strtotime function. However, then I looked at the getopt() implementation available stock with PHP and realized I was either going to have to package a third party option (the pickings there were also slim), write my own lib to do it, or write a whole shitload of custom code specifically for streamstats. However, the first option was not attractive due to the fact that I’d have to create a package for it for Yahoo!’s packaging system, and the other two are unattractive because… well, I’m trying to write a fucking stats analysis function, not options handling code.

That means streamstats is being rewritten, for the 3rd time. in Perl. I’ll be using Getopt::Long and Date::Manip to keep the auxiliary logic out of the script. Luckily, the basic functionality won’t take long to recreate, and the features I’ve been trying to add shouldn’t be too bad either. Plus, I get to finally re-learn Perl.

Things to look forward to:

  • Time awareness (i.e. calculating how many times per second a certain value occurs, a distribution of frequencies, limiting the timeframe of interest)
  • proper histograms, with buckets etc
  • custom patterns (i.e., being able to specify which part of the incoming string is the time and which part is the value, for those that don’t want to use grep/awk to narrow it down beforehand
  • multiple column comparison and relevant stats (gonna have to bust out the textbook on this one)

The next couple of months promise to be exciting in terms of shipping things, both at Flickr and outside. I’m looking forward to posting actual code for streamstats.pl.

Chiming in on Duck Typing vs Static Typing

September 23rd, 2009

Crowley and Malone have been going at it over duck typing vs static typing. I’ve decided to invite myself into the discussion. Crowley’s initial post is here. Malone’s response is here (read the comments as well) and Crowley’s follow up is here.

For static typing

Richard’s argument is that duck typing does not alert the developer that the wrong argument type had been supplied to a method until it is too late, if at all. His two examples are 1. passing a CreditCard to a template instead of a Kitten, resulting in the CreditCard’s number being displayed where a Kitten’s name should have shown up and 2. more interestingly, a method which does something scary up front (presumably, by ‘poke_a_tiger_with_a_stick()’ Richard means a persistent data write or a financial transaction), then iterates over one of its arguments and fails if the argument passed is not an iterable. Richard would rather know that the argument passed is not of the right type up front, so that the method would fail immediately. He suggests “turkey typing,” a C++ STL-like syntax for telling the compiler what the argument to a function should be. In his example, he uses an iterable containing only certain types of objects.

For duck typing

Mike’s argument is that duck typing provides much more flexibility than static typing – if you need an object to only fullfill one part of a certain interface, you can simply define that part of the interface, plug the object into a method that expects that specific part of the interface, and be on your merry way. Further, Malone wants exactly what Richard doesn’t – to be able to pile all sorts of different types of objects into a collection. Another important point he brings up in the comments is that duck typing allows many different developers to come up with many different solutions to a problem defined by an interface without having to share any common ground at all. As long as the same interfaces are implemented, the objects can be swapped in and out. Free-flyin polymorphism.

Does “Turkey Typing” work?

Richard’s proposal is to specify the type of argument up front – use some syntax to say that an argument should only be of a certain type or an iterable with only certain types in it, and everything is peachy: if you pass the wrong type, the compiler yells at you before you get to any logic.

However, this lacks flexibility in the same way that static typing does. It effectively encourages the creation of a whitelist for what sorts of things should be allowed into the method. Interfaces (the language constructs) can mitigate most of this, but only if the developer of the method uses an interface – extra boilerplate code.

Let’s take Flickr as an example. Initially, users could only upload photos. Using turkey typing, we’d have functions that would expect an object of type Photo. Then Flickr added video. We would have to go through all the functions that deal with generic actions – uploading, describing, deleting, etc – and edit the signature to also include Video objects. Of course, we could create a superclass, but then all of a sudden this doesn’t seem very different from the statically typed world Malone describes – lots of useless code just to be able to do simple things.

Both sides are interested in the same thing – an object’s capabilities. Malone is fine either A. assuming that people know what they’re doing or B. checking the capabilities manually. Crowley wants that check to be contained in the signature. The problem is, he wants to discover the capabilities by looking at a name. It’s equivalent to using the UserAgent string of a browser to try to figure out what code it can and can’t handle. It just doesn’t go well. If Richard’s syntax is to accomplish what it’s really after, it’ll have to become so complicated, that it’s basically not worth the trouble.

The tiger argument

Let’s turn Richard’s example into something more realistic for a second. Say you have a function called purchase_items(), which takes as arguments a user object an array of items, puts those items into a shipping queue, and charges the user’s credit card. poke_a_tiger_with_a_stick() in this case is charging the credit card, and the loop that follows is the insertion of items into a shipping queue.

Richard is right to say that if you charge the credit card, and then the rest of the function fails because some idiot dev you hired straight out of college passed something stupid instead of the array of items (maybe a single item), your customer will be pretty pissed.

The problem with this thought experiment is that it uses shitty coding as its premise. If you charge the credit card before you’re sure you’ve verified your data and successfully recorded the order, you deserve that angry support phonecall. You should always save the critical transactions for for last, and I know that Richard knows this. You’ll never see him poking a tiger with a stick as the first order of business. Simply restructuring the method would surface the problem with the argument immediately.

Where I stand

I think Malone’s assessment is correct: static typing and permutations thereof ultimately aim at saving people from doing stupid things. However, there is plenty of software out there written in statically typed languages that does stupid things. You can’t help stupid. You can, however, write your code in a way that minimizes its impact, but you don’t need typing for that, just common sense.

Meanwhile, loosely typed languages allow allow for great flexibility and efficiency.

Biases

I think it’s worth noting that Crowley spends most of his time writing C and C++, whereas Malone and I work mostly in Python and PHP, respectively. I’m pretty sure that all of our preferences just line up with what we’re comfortable with.

Pull random data out of MySQL without making it cry – using and optimizing ORDER BY, LIMIT, OFFSET etc.

September 23rd, 2009

Motivation

This isn’t a rocket science post. Your world view will not be changed by it. It’s just a trick I came up with while pulling random data out of a database that gives a pretty good mixture of results without straining the server.

Goals

  • Data that is random enough that a user will be entertained for at least the first 3 or 4 reloads
  • Low cost to the database

Ways not to do it

If your immediate response to seeing this post is “DUH, use ORDER BY RND(),” you are fired.

Step 1: random subset of non-random data

If you only need a few items (N), the easiest thing to do is to just grab N*k rows and pick randomly out of that set. In pseudo-PHP, it looks about like this:

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $q = "SELECT * FROM bicycles LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

The downside here is that your overall result set is essentially always the same. We can do better, though.

Step 2: random ORDER BY criteria

Examine the following simple table from our fictional bike shop

mysql> show create table bicycle \G
*************************** 1. row ***************************
       Table: bicycle
Create Table: CREATE TABLE `bicycle` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `frame` tinyint(3) unsigned NOT NULL default '0',
  `in_stock` tinyint(3) unsigned NOT NULL default '0',
  `date_available` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  KEY `available_and_when` (`in_stock`,`date_available`)
) ENGINE=InnoDB AUTO_INCREMENT=510241 DEFAULT CHARSET=utf8

Assuming you want to show random bicycles that are in stock, the query from above changes to

$q = "SELECT * FROM bicycles WHERE in_stock = 1 LIMIT $limit";

However, notice the available_and_when key: it includes in_stock as the left most column, but also date_available. Since we’re already using the leftmost column in our query, we can sort using the second column while still using the index – FO FREE*!

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5;
+-----+-------+----------+----------------+
| id  | frame | in_stock | date_available |
+-----+-------+----------+----------------+
... SNIP ...
+-----+-------+----------+----------------+
5 rows in set (0.00 sec)

mysql> explain select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: bicycle
         type: ref
possible_keys: available_and_when
          key: available_and_when
      key_len: 1
          ref: const
         rows: 86900
        Extra: Using where
1 row in set (0.00 sec)

Why does this matter? It means that you can take advantage of this when grabbing your set of non-random data. You can randomly pick which way you’re going to sort the table, then grab the $n*$k rows. Our code snippet becomes

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_by = $order_candidates[array_rand($order_candidates)];
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

Depending on your query and the indexes you have available to you, you could have numerous sorting possibilities. The more different ways you can sort, the more random your data will be. If the randomness of this data is truly important, you can add indexes to accommodate more ways of ordering.

DO NOT BE A COCK! Run every single possibility through explain. If you fuck up and order by something stupid, your DBA/Ops will murder you, and your site will be balls slow.

Step 3: Random OFFSET

This could very well have been Step 2, but I like this part less, so it goes later. This part requires a more extensive use of common sense, thus leaves more room for stupid.

Now that you’ve got some different ways you could order the data, you could also broaden your range by specifying a random offset. There are two caveats:

  1. You don’t always know how many rows you have in total, and finding out could be expensive
  2. You don’t want too high of an offset, because that will be slow

Since #2 means you should never even THINK about #1, I guess there’s really only the one caveat.

What do I mean? Let’s look at some queries against my measly 500k rows table:

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 0;
... SNIP ...
5 rows in set (0.00 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 5000;
... SNIP ...
5 rows in set (0.07 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 50000;
... SNIP ...
5 rows in set (1.15 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 100000;
... SNIP ...
5 rows in set (1.97 sec)

mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 250000;
... SNIP ...
5 rows in set (4.89 sec)

The reason for the degrading performance is that MySQL has to generate the entire result set, then throw away everything below the offset. However, we can see that the performance is tolerable up to a point. We can certainly afford to pick from the first few thousand rows (or hundred, depending on your requirements).

Let’s implement the random offset:

<?php
    $n = 4; # gimme this many
    $k = 5;
    $limit = $n*$k;
    $order_candidates = array('date_available', 'date_available DESC');
    $order_rnd = array_rand($order_candidates); # separated for better cacheability
    $order_by = $order_candidates[$order_rnd];
    $offset_rnd = rand(0, 20); # separated for better cacheability
    $offset = $offset_rnd * 100;
    $q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit OFFSET $offset";
    $rows = mysql_give_me_the_fucking_rows($q); # AWESOME!

    for ($i=0, $random_items = array(); $i<$n; ++$i){
        $random_items[] = reset(array_splice($rows, array_rand($rows), 1));
    }
?>

Working around OFFSET performance degradation

If you just can’t sleep well, knowing that you are not pulling rows from the entire table, you should really seek professional help – something far beyond whatever I can provide you with. Seriously, who gives a shit? However, I do have something to tide you over. Among other places, I’ve seen this workaround suggested in this MySQL Performance Blog Post. If you have a sequential column you can use (i.e., that is included in an index you can use for your query), you could use a WHERE col BETWEEN x AND y clause to simulate random offsets. Caveat: data may not be evenly distributed, so you might hit a range that has no rows in it, or not enough rows for your random sample.

For this scenario, MySQL does do some very handy things. If your table looks like mine, i.e. has an index on the column we’re selecting by AND the column we want to use for our offset, you can get the MIN and MAX bounds FO FREE**:

mysql> select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1;
+---------------------+---------------------+
| MAX(date_available) | MIN(date_available) |
+---------------------+---------------------+
|          1253385344 |          1249496986 | 
+---------------------+---------------------+
1 row in set (0.00 sec)

mysql> explain select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: NULL
         type: NULL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
        Extra: Select tables optimized away <<< fucking sweet

Once you have the minimum and maximum timestamps, you can just pick a random range. As I said above, you could get yourself an empty result set, so be sure you only do this with uniformly distributed data and a wide enough range.

A final note: don’t forget to fucking cache

Although the entire purpose of this post is to show how to do these things in a cheap way, it’s still important to remember: MySQL time is much more expensive than memcache time. For something like this, you should definitely be caching. The strategy is very simple: generate your query params (limit, order direction, offset), then use those to build your key. Check that you don’t already have this result set in cache; if you don’t, query the database and store it in the cache using that key. The next time you get the same random criteria, you’ll be able to get the rows – FO FREE***. For our last code example, the cache key could look like this:

$key = "BicycleRandom_$limit_{$order_rnd}_{$offset_rnd}";

That’ll do it!

* By ‘FO FREE’ I mean free, aside from the cost of maintaining the index
** This time, I really did mean FREE
*** Or almost free, anyway

streamstats: easier log analysis

August 22nd, 2009

(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)

Background

After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.

The code can be found on github.

Basic Functionality

Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)

Output might look like this (fake info):

awk '{print $1}' access.log | streamstats.py

192.168.10.100 |                                        1 192.168.10.101 |                                        1 192.168.10.102 ||||                                     4 192.168.10.103 ||||                                     4 192.168.10.104 |                                        1 192.168.10.105 |                                        1 192.168.10.106 |||||                                    5 192.168.10.107 ||||                                     4 192.168.10.108 |||||||||||                              11 * 192.168.10.109 ||||||                                   6 192.168.10.110 |||                                      3 192.168.10.111 |                                        1

Some Statsy Things count 12 outliers 1 maximum 11 minimum 1 stdev 2.84312035154 total 42 mean 3.5

As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.

Moar! Time Awareness

The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.

Implementation Possibilities

The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?

Candidate #1: awk and friends

My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.

Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.

Candidate #2: Regular Expressions

Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:

  • -t –time to let streamstats know that we’re interested in time, with an optional argument that would be the regex pattern used to find the time
  • -p –pattern to specify the pattern used to actually parse the value out of every line of input

However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.

Instead, I think it might be easier to take advantage of named groups and add the following arguments:

  • -t –time to let streamstats know that we’re interested in time
  • -p –pattern to specify the regular expression that will parse out both the time and the value. In the absence of -t, the default pattern would be ^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.

The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.

I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.

Django: exposing settings in templates, the easy way

July 26th, 2009

NOTE: I have submitted a patch to have this functionality as part of Django core. I’m posting this code here in case the patch gets rejected, since similar motions have previously been rejected. The ticket is at http://code.djangoproject.com/ticket/3818, and I will update this post if the patch gets accepted.

I find it very handy to be able to expose parts of settings.py in templates. The Django docs show you how to do this using context processors and give the code for the MEDIA_URL context processor here. However, it seems silly to me to have to write custom processors all the time. Instead, I propose a single processor that exposes a list of settings (stored in settings.py itself) to the template layer.

The gist containing the code is here: http://gist.github.com/155894

Enjoy.

daemontools 0.76 on Ubuntu 9.04

July 5th, 2009

I find daemontools to be incredibly useful for quickly turning scripts into daemons. I also like to run the latest version of Ubuntu, which means there is no (well-maintained) package available. There’s a package in ‘universe’ but it puts things in strange places and generally inspires distrust (explained below). As installing daemontools from source on Ubuntu hasn’t been straight forward (I had trouble both on 8.10 and 9.04), I’m gonna post a quick howto which pulls together a few pieces of information scattered throughout the internets.

Crash Course

daemontools comes with a bunch of tools. The full docs are here, but I’ll cover the basics.

  • supervise is a daemon that takes a directory as an argument and tries to execute the file named ‘run’ in that directory, while keeping state in the ‘supervise’ subdirectory. Naturally, you want to give it a folder with an executable file named ‘run’ and a wirtable subdirectory called ‘supervise’ in it.
  • svstat, svok, svc, etc are used to manipulate individual supervise jobs and check their status. These aren’t covered here
  • svscan is a daemon that also takes a directory as an argument and proceeds to look for supervise-able directories within. Any time it finds one that doesn’t already have a corresponding supervise process, it fires one up. Thus if the supervise instance you count on to do some background processing crashes for a whatever reason, svscan will just start it back up
  • svscanboot is the script that should be run at boot-time to start svscan. It will direct svscan to look for jobs in the /service directory.

In an ideal world, you throw a bunch of shit into /service, svscan picks it up and fires up a supervise process for all the subdirectories. It runs as a daemon and restarts supervise whenever needed (svscan itself is basically bulletproof in my experience). When your machine is rebooted for whatever reason, svscanboot restarts svscan and everything resumes as before. The reason I choose not to use the Ubuntu package is because it does not create this ideal world for me.

Compiling

First of all, daemontools won’t compile as is. Follow along the instructions here until you get to the exciting part where you type ‘package/install’ and then get an error message that looks like this:

/usr/bin/ld: errno: TLS definition in /lib/libc.so.6 section .tbss
mismatches non-TLS reference in envdir.o
/lib/libc.so.6: could not read symbols: Bad value
collect2: ld returned 1 exit status
make: *** [envdir] Error 1

Never fret. I have found a couple of possible fixes, but the patch below appears to be the cleanest one. [1]

diff -ur daemontools-0.76.old/src/error.h daemontools-0.76/src/error.h
--- daemontools-0.76.old/src/error.h    2001-07-12 11:49:49.000000000 -0500
+++ daemontools-0.76/src/error.h    2003-01-09 21:52:01.000000000 -0600
@@ -3,7 +3,7 @@
#ifndef ERROR_H
#define ERROR_H

-extern int errno;
+#include <errno.h>

extern int error_intr;
extern int error_nomem;

Assuming you are still at the point in the installation instructions right before the ‘package/install’ command, copy the patch into /tmp/daemontools-errno.patch, then

$ cd src
$ patch < /tmp/daemontools-errno.patch
$ cd ..
Then resume the original daemontools instructions as prescribed.

Testing

The installation creates two folders: /command and /service. The former contains all the different daemontools commands, while the latter is the default folder which the svscan daemon scans for jobs. To test that our installation worked, let’s create a proper test job (you’ll probably want to be superuser for this, since /service is owned by root by default)

$ cd /service
$ mkdir tester
$ mkdir tester/supervise
$ touch tester/run
$ chmod +x tester/run
in tester/run, put any script. It can be empty, but if you really want to verify things are working, make it output something to a tmp file. Sample:

#!/bin/bash

echo "Foo!" >> /tmp/foo.log
sleep 5

Once everything is in place, as superuser run the following commands. If your output looks different, something’s broke.

$ /command/svscan /service &
[1] 9641
$ ps -ef | grep svscan
root      9641 17108  0 18:12 pts/1    00:00:00 /command/svscan /service
mikhailp  9646 17108  0 18:12 pts/1    00:00:00 grep svscan
$ ps -ef | grep supervise
root      9642  9641  0 18:12 pts/1    00:00:00 supervise tester
mikhailp  9660 17108  0 18:12 pts/1    00:00:00 grep supervise
If you made your tester script log something, check that the output is showing up where expected. If all is well, shut ‘er down and move on.
$ killall svscan
$ killall supervise

Starting at Bootup

Assuming you want to user supervise for some critical offline part of your application, you want it to start automatically whenever your server boots up. You don’t want to have to manually start up svscan every time.

Ubuntu uses upstart for task management instead of sysvinit. In order to let upstart start and stop the svscan daemon correctly, the following script needs to be copied into /etc/event.d/svscan [2]

# svscan - daemontools
#
# This service starts daemontools from the point the system is
# started until it is shut down again.

start on startup

start on runlevel 1
start on runlevel 2
start on runlevel 3
start on runlevel 4
start on runlevel 5
start on runlevel 6

stop on shutdown

exec /command/svscanboot

Once that’s in place, you can use start svscan and stop svscan. Try starting it and checking ps output as described in the Testing section. Finally, reboot the box (if possible) and repeat the same checks. At this point, svscan should start up with your system and spawn supervise processes on any job it finds in /service.

That’s all there is to it. Enjoy.

1: Patch originally found here, linked from here. 2: Corrected from this entry posted in 2006 which has some obsolete info

Google Data API With OAuth Using the GData Python Client

April 22nd, 2009

On one of my projects (the one that’s kept me too busy to blog), I had to work quite a bit with the google data API. I was using Django, so I was using the gdata python client.

The client is, for the most, part excellent. It abstracts away the parts of OAuth that you never ever want to have to know anything about or debug. All the signing, encoding, and decoding takes place inside the library – where it should happen.

The spec is somewhat young, and the library is even more so; there isn’t a whole lot out there covering the subject. The client comes with some sample scripts, but they don’t make it very clear how to set things up in a web app workflow. I’m going to try to fill the void somewhat with a sample Django app and a somewhat extensive writeup. You don’t have to know/understand Django to follow along, but it may help.

Disclaimer: The sample app is written in the simplest possible way, so as to focus on the code that actually has to do with OAuth and the GData API. When I say “sample,” I really fucking mean it. I pay no attention to things like security and good practices, because I don’t give a shit about those things, and neither should you. The purpose is to understand how Google’s python client works. If you copy and paste this code into your production app and push your key, secret, and RSA creds to a public github repository, that is your own stupid fault.

OAuth Crash Course

There are many better resources for learning about OAuth proper, so I’ll cover the basics. OAuth is a protocol for securing communication between an API (provider) and an app that wants to use said API (consumer). It’s fairly complicated, but due to Google having a pretty nice client wrapped around their API, most of that is abstracted away. What we need to know is the general OAuth workflow:

  1. User invokes a part of the consumer app that requires access to the provider’s API
  2. If the consumer already has an OAuth Access Token for that user, skip down to step 6
  3. If not, the consumer fetches an OAuth token key and token secret from the provider. The token secret is stored. The consumer then directs the user to an Authorization URL provided by the provider, with the token key as a query parameter, along with a “callback” URL pointing back to the consumer
  4. The user authorizes the consumer to use the provider’s data; the provider redirects the user back to the callback URL with the “Authorized Request Token” ‘oauth_token’ query variable set to its token key. The consumer can then request an Access Token from the provider by sending this token key back with the token secret that was saved in the previous step
  5. The provider returns the Access Token. The consumer can now use this Access Token to access the user’s data through the provider’s API until the user elects to revoke the Access Token
  6. The consumer does magic tricks with user’s data, including making it disappear

Simple, right? I’ve left out all the encryption details (those are still murky to me as well), but luckily we don’t have to worry about that too much. I’ll reference this list of steps as I go along – mostly to keep myself sane. Let’s call a spade a fucking spade – shit is confusing! More info on OAuth and Google’s implementation of it can be found here. If you want to be really hardcore, you can read the spec here.

Some key terms to remember along the way:

  • Provider – service with the user’s data (in this case, Google)
  • Consumer – the app that wants to access the user’s data (your app)
  • Request Token – refers to a token used by the consumer to request access from the provider/user. It can be authorized and unauthorized. The token is initially given by the provider to the consumer in an unauthorized state. The consumer has to then direct the user to the provider’s authorization page with the unauthorized token. If the user accepts, the provider will send the user back to the consumer with an authorized request token.
  • Access Token – refers to a token used by the consumer to actually authenticate with the provider. The consumer has to exchange an authorized request token for an access token before it can access data. The access token is what gives the consumer long-term access to the user’s data.

Getting Started

First and foremost, go here and sign up for an API key. Pretty straight forward. Store your App key and secret somewhere – I’m using the settings.py file in my sample app, though that may not be optimal security-wise. You’ll also see an rsa_key in my GDATA_CREDS dictionary. That is there because my example uses RSA_SHA1 encryption. You don’t have to use that, but I recommend it, and the example will be easier to follow. I’ve seen in some documentation that may or may not be outdated that the contacts API only accepts RSA encryption, though I haven’t tried HMAC myself. To setup your RSA encryption, follow the steps here. You should be able to accomplish this on most unix/linux boxes.

Now that you have your app key and secret, and you’ve set up RSA encryption, you should download and install the the gdata lib itself. Get it here.

Storing the Token

Before you start using tokens, you need some way to store them. A token consists of a token key and a token secret. You could create a database table to store those things individually. Me, I’m lazy – I just pickle my entire gdata token object. It lets me rebuild a functioning token object on the spot when I fetch it from the database. So for my case, I just have a text field for storing the token itself. See models.py in the sample app.

You could probably just as easily define a model with token and token secret fields and a method to return an initialized token object. Whatever your preference, define an appropriate model. You also need a way to tie the tokens to a user. My sample app does this using a foreign key to the django.contrib.auth.models.User model.

Creating a Token

Now we’re ready to actually start down the path towards getting the OAuth Access Token – the jackpot of API access. Seriously, you will feel a sense of palpable relief when you finally get one working. Like when passing a pineapple.

If you’re following along in the app, we’re going to skip over shit like user signup and auth (it’s only there so that people could set this up on their server and really see it work) and go straight for the money: getting the OAuth token. Look at /oauth/views.py->add_token. Approximate line numbers will be given in square brackets.

The first line of the view [~97] fetches the right scope for dealing with the contacts API. The scopes determine what part of the user’s data your token will have access to and are not part of the OAuth spec. They are stored under some pretty cryptic keys, so you may just have to look in the gdata source under gdata/service.py to find the right ones. I just want to use the contacts feed for the example, so I put ‘cp’ (what else would it be, right?)

Next, we initialize the gd_client variable, which is the object that will take care of all of our OAuth needs and communicate with TEH GOOGALZ [~99]. You will see that I tell it to use RSA encryption and pass it my app’s API credentials from settings.py. There are a few different subclasses to the base client that provide more specific methods for dealing with individual feeds (contacts, blogger, docs, etc). We’ll just use the base client for initializing our token.

Now we arrive at a conditional to check what part of the process we’re on [~107]. If we don’t have a ‘oauth_token’ get parameter, that means that we’re at step 3 in our process – we have to send the user to Google to hit the big Authorize button.

The first thing we do is initialize a request token (rt). We then store the ‘secret’ part of that token. I store it in a session for simplicity/readability, but it can be stored anywhere.

We then tell our client to use this request token and to generate an authorization URL for that token [~116]. That’s the URL we will send our user to in order for them to authorize us. Note that in the callback we just pass the current URL. When the user clicks ‘Allow’ on Google’s authorization page, they will be sent right back to this same view, but this time with the ‘oauth_token’ parameter, meaning they will hit the ‘else’ part of our conditional.

When we do get back to the page, we’ll need to reconstruct our token [~127]. Google has modified our token and marked it authorized, but we still need to put back our piece of the puzzle – the token secret. Since we stored it in a session variable, we retrieve it with ease. Since our token now has no context, we have to tell it what scopes we’re interested it in as well.

So what we have now is an authorized request token. We’re still not out of the woods. Now we need to upgrade our token to an actual access token. This is done via the UpgradeToOAuthAccessToken method of the gdata client [~135].

As the comments in my code indicate, this part is a tad shifty [~141], at least as of version 1.2.4 of the python client. Since upgrading the token modifies it once more, we need to store the end result of the upgrade. However, the upgrade method does not return that modified token, it just stores it in the client’s token store, hence the find_token call. See discussion here for more info. UPDATE 05/05/2009: the patch to fix this awkwardness just got committed to SVN, so hopefully the next version of the library will no longer require the extra step.

Regardless, once we’ve retrieved the authorized token (at), we can store it [~145]. As I had mentioned, I just pickle it into a TextField.

That’s it. Once you’ve saved that token, you can reconstruct it at any time to fetch any data that token is authorized for (as in, which scopes you specified when sending the token to Google).

To see it in action, just take a look at the views.py->home. First, we find the latest OAuth token in our database [~66]. If we don’t have one, we redirect the user to the add_token view to go through the OAuth process [~86]. If we do have one, we fetch some data.

We create and initialize a ContactsService client [~71-78] and then pass it our token [~79-80]. Then BAM, we fetch some contacts from our user’s contacts feed.

That’s all for now. Hopefully this overview is helpful in getting everybody and their mother developing apps using the Google APIs. I welcome corrections and additions, and will update this post if I get any important ones.

Bye Bye, Internets!

March 14th, 2009

I’m leaving for France tomorrow. I’m not taking my laptop, I won’t be using data on my phone (AT&T international data rates are something else), and I won’t be checking my email.

If you need to reach me, shoot me an email and I will get back to you promptly – on or after March 22nd :)

Drizzle: Time To Get Excited

March 3rd, 2009

Brian Aker was in San Francisco on Monday night to talk about to the MySQL and PHP Meetup groups about Drizzle, the fork of MySQL aimed at meeting the needs of large web applications.

I had heard various tidbits here and there about Drizzle, including at MySQL2008, and had a pretty vague impression of how it was actually going to be different. The general theme seemed to be simpler, smaller, and faster.

If everything Brian described is true, it’s those things plus FUCKING AWESOME. I will join my friends (here, for example) in foaming at the mouth about it. Below is a rough list of things Brian mentioned that get me all hot’n'bothered about Drizzle, in no particular order:

  • No Query Cache. The best part here is the reasoning for its absence: “If you’re relying on the query cache, you probably should have just used memcached to begin with.”
  • In-Query Sharding Info. I’m not exactly sure of the details of the implementation, but it sounds like Drizzle will make it possible to spray queries through a to the right shard automatically.
  • Pluggable Authentication. Finally. Authentication can also be completely turned off.
  • Serializable Query Plans. This feature will allow the parser to be bypassed entirely on most queries. You simply send the query to MySQL, get the execution plan back, cache that, and send the execution plan back the next time you need the same query. That is Fucking Bad Ass (TM).
  • Fewer Locks. Getting rid of a lot of the more advanced and rarely used features introduced in MySQL 5/5.1 (views, stored procedures, etc.), as well as some of the more basic stuff like authentication, has allowed Drizzle to lose 2/3 of the locks present in 5.0 (at least that’s what I think Brian said… sounds surreal…) which obviously opens the door for vast improvements on multicore architectures. It also sounded like Brian made the decision during his talk to discontinue MyISAM support in Drizzle to be able to get rid of another huge lock. I’m OK with it…
  • Discontinued support for antique hardware. Lots of code ripped out because it’s no longer needed.
  • Everything is pluggable, most things are optional. The Drizzle kernel is about 115kloc. Amazing.

Brian was very coy about benchmarks, leaving it to independent sources to run them, but it sounds like Drizzle will leave MySQL in the dust for most of the common applications seen in webapps. I can’t wait to try it and to use it on a few things.

APIMuni by Danny Roa – Bringing NextMuni To The Masses

February 28th, 2009

Danny Roa, whom I met at the last Django Meetup, has put out a quick API for accessing Nextbus data.

It’s hosted on the App Engine and can be found here.

His writeup is here.

He recycled the scraping code from yourmuni, props to him for giving props :) Of course, that just means that when Nextbus gets angry, they’re going to come after me first!

Developers don’t create API’s for nothing, so I am eagerly anticipating what Danny is going to use this API for.