In other news, this is post #100 on this blog.
In other news, this is post #100 on this blog.
I had a post brewing in my head for a couple of weeks that was going to be titled “Keeping tabs on your crontabs,” inspired by some recent margins-of-the-day stuff I’d been doing. However, Peter Z at the MySQL Performance Blog has beaten me to it. I still maintain my title idea was better
Check it out, it has some nice, easy-to-implement tips for making sure your crons don’t do stupid things and making your ops team happy (and we’re all about that at Flickr).
Just wanted to post a quick note about the state of streamstats, the little tool I’ve been working on for analyzing logs/data files. Things stalled a bit when I started trying to implement time awareness, as it turned out that Python’s time parsing capabilities are limited, to put it nicely. I even tried to use regex to find a matching pattern before parsing the date, but I was unable to parse common date formats found in the logs this tool is intended to parse (namely, apache logs; and no, changing the date format for all of Flickr’s hosts is not a fucking option, ok?) This was unacceptable.
I quickly recreated the basic functionality in PHP, using the famed strtotime function. However, then I looked at the getopt() implementation available stock with PHP and realized I was either going to have to package a third party option (the pickings there were also slim), write my own lib to do it, or write a whole shitload of custom code specifically for streamstats. However, the first option was not attractive due to the fact that I’d have to create a package for it for Yahoo!’s packaging system, and the other two are unattractive because… well, I’m trying to write a fucking stats analysis function, not options handling code.
That means streamstats is being rewritten, for the 3rd time. in Perl. I’ll be using Getopt::Long and Date::Manip to keep the auxiliary logic out of the script. Luckily, the basic functionality won’t take long to recreate, and the features I’ve been trying to add shouldn’t be too bad either. Plus, I get to finally re-learn Perl.
The next couple of months promise to be exciting in terms of shipping things, both at Flickr and outside. I’m looking forward to posting actual code for streamstats.pl.
Crowley and Malone have been going at it over duck typing vs static typing. I’ve decided to invite myself into the discussion. Crowley’s initial post is here. Malone’s response is here (read the comments as well) and Crowley’s follow up is here.
Richard’s argument is that duck typing does not alert the developer that the wrong argument type had been supplied to a method until it is too late, if at all. His two examples are 1. passing a CreditCard to a template instead of a Kitten, resulting in the CreditCard’s number being displayed where a Kitten’s name should have shown up and 2. more interestingly, a method which does something scary up front (presumably, by ‘poke_a_tiger_with_a_stick()’ Richard means a persistent data write or a financial transaction), then iterates over one of its arguments and fails if the argument passed is not an iterable. Richard would rather know that the argument passed is not of the right type up front, so that the method would fail immediately. He suggests “turkey typing,” a C++ STL-like syntax for telling the compiler what the argument to a function should be. In his example, he uses an iterable containing only certain types of objects.
Mike’s argument is that duck typing provides much more flexibility than static typing – if you need an object to only fullfill one part of a certain interface, you can simply define that part of the interface, plug the object into a method that expects that specific part of the interface, and be on your merry way. Further, Malone wants exactly what Richard doesn’t – to be able to pile all sorts of different types of objects into a collection. Another important point he brings up in the comments is that duck typing allows many different developers to come up with many different solutions to a problem defined by an interface without having to share any common ground at all. As long as the same interfaces are implemented, the objects can be swapped in and out. Free-flyin polymorphism.
Richard’s proposal is to specify the type of argument up front – use some syntax to say that an argument should only be of a certain type or an iterable with only certain types in it, and everything is peachy: if you pass the wrong type, the compiler yells at you before you get to any logic.
However, this lacks flexibility in the same way that static typing does. It effectively encourages the creation of a whitelist for what sorts of things should be allowed into the method. Interfaces (the language constructs) can mitigate most of this, but only if the developer of the method uses an interface – extra boilerplate code.
Let’s take Flickr as an example. Initially, users could only upload photos. Using turkey typing, we’d have functions that would expect an object of type Photo. Then Flickr added video. We would have to go through all the functions that deal with generic actions – uploading, describing, deleting, etc – and edit the signature to also include Video objects. Of course, we could create a superclass, but then all of a sudden this doesn’t seem very different from the statically typed world Malone describes – lots of useless code just to be able to do simple things.
Both sides are interested in the same thing – an object’s capabilities. Malone is fine either A. assuming that people know what they’re doing or B. checking the capabilities manually. Crowley wants that check to be contained in the signature. The problem is, he wants to discover the capabilities by looking at a name. It’s equivalent to using the UserAgent string of a browser to try to figure out what code it can and can’t handle. It just doesn’t go well. If Richard’s syntax is to accomplish what it’s really after, it’ll have to become so complicated, that it’s basically not worth the trouble.
Let’s turn Richard’s example into something more realistic for a second. Say you have a function called purchase_items(), which takes as arguments a user object an array of items, puts those items into a shipping queue, and charges the user’s credit card. poke_a_tiger_with_a_stick() in this case is charging the credit card, and the loop that follows is the insertion of items into a shipping queue.
Richard is right to say that if you charge the credit card, and then the rest of the function fails because some idiot dev you hired straight out of college passed something stupid instead of the array of items (maybe a single item), your customer will be pretty pissed.
The problem with this thought experiment is that it uses shitty coding as its premise. If you charge the credit card before you’re sure you’ve verified your data and successfully recorded the order, you deserve that angry support phonecall. You should always save the critical transactions for for last, and I know that Richard knows this. You’ll never see him poking a tiger with a stick as the first order of business. Simply restructuring the method would surface the problem with the argument immediately.
I think Malone’s assessment is correct: static typing and permutations thereof ultimately aim at saving people from doing stupid things. However, there is plenty of software out there written in statically typed languages that does stupid things. You can’t help stupid. You can, however, write your code in a way that minimizes its impact, but you don’t need typing for that, just common sense.
Meanwhile, loosely typed languages allow allow for great flexibility and efficiency.
I think it’s worth noting that Crowley spends most of his time writing C and C++, whereas Malone and I work mostly in Python and PHP, respectively. I’m pretty sure that all of our preferences just line up with what we’re comfortable with.
This isn’t a rocket science post. Your world view will not be changed by it. It’s just a trick I came up with while pulling random data out of a database that gives a pretty good mixture of results without straining the server.
If your immediate response to seeing this post is “DUH, use ORDER BY RND(),” you are fired.
If you only need a few items (N), the easiest thing to do is to just grab N*k rows and pick randomly out of that set. In pseudo-PHP, it looks about like this:
<?php
$n = 4; # gimme this many
$k = 5;
$limit = $n*$k;
$q = "SELECT * FROM bicycles LIMIT $limit";
$rows = mysql_give_me_the_fucking_rows($q); # AWESOME!
for ($i=0, $random_items = array(); $i<$n; ++$i){
$random_items[] = reset(array_splice($rows, array_rand($rows), 1));
}
?>
The downside here is that your overall result set is essentially always the same. We can do better, though.
Examine the following simple table from our fictional bike shop
mysql> show create table bicycle \G
*************************** 1. row ***************************
Table: bicycle
Create Table: CREATE TABLE `bicycle` (
`id` bigint(20) unsigned NOT NULL auto_increment,
`frame` tinyint(3) unsigned NOT NULL default '0',
`in_stock` tinyint(3) unsigned NOT NULL default '0',
`date_available` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`id`),
KEY `available_and_when` (`in_stock`,`date_available`)
) ENGINE=InnoDB AUTO_INCREMENT=510241 DEFAULT CHARSET=utf8
Assuming you want to show random bicycles that are in stock, the query from above changes to
$q = "SELECT * FROM bicycles WHERE in_stock = 1 LIMIT $limit";
However, notice the available_and_when key: it includes in_stock as the left most column, but also date_available. Since we’re already using the leftmost column in our query, we can sort using the second column while still using the index – FO FREE*!
mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5;
+-----+-------+----------+----------------+
| id | frame | in_stock | date_available |
+-----+-------+----------+----------------+
... SNIP ...
+-----+-------+----------+----------------+
5 rows in set (0.00 sec)
mysql> explain select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: bicycle
type: ref
possible_keys: available_and_when
key: available_and_when
key_len: 1
ref: const
rows: 86900
Extra: Using where
1 row in set (0.00 sec)
Why does this matter? It means that you can take advantage of this when grabbing your set of non-random data. You can randomly pick which way you’re going to sort the table, then grab the $n*$k rows. Our code snippet becomes
<?php
$n = 4; # gimme this many
$k = 5;
$limit = $n*$k;
$order_candidates = array('date_available', 'date_available DESC');
$order_by = $order_candidates[array_rand($order_candidates)];
$q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit";
$rows = mysql_give_me_the_fucking_rows($q); # AWESOME!
for ($i=0, $random_items = array(); $i<$n; ++$i){
$random_items[] = reset(array_splice($rows, array_rand($rows), 1));
}
?>
Depending on your query and the indexes you have available to you, you could have numerous sorting possibilities. The more different ways you can sort, the more random your data will be. If the randomness of this data is truly important, you can add indexes to accommodate more ways of ordering.
DO NOT BE A COCK! Run every single possibility through explain. If you fuck up and order by something stupid, your DBA/Ops will murder you, and your site will be balls slow.
This could very well have been Step 2, but I like this part less, so it goes later. This part requires a more extensive use of common sense, thus leaves more room for stupid.
Now that you’ve got some different ways you could order the data, you could also broaden your range by specifying a random offset. There are two caveats:
Since #2 means you should never even THINK about #1, I guess there’s really only the one caveat.
What do I mean? Let’s look at some queries against my measly 500k rows table:
mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 0;
... SNIP ...
5 rows in set (0.00 sec)
mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 5000;
... SNIP ...
5 rows in set (0.07 sec)
mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 50000;
... SNIP ...
5 rows in set (1.15 sec)
mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 100000;
... SNIP ...
5 rows in set (1.97 sec)
mysql> select * from bicycle WHERE in_stock = 1 ORDER BY date_available LIMIT 5 OFFSET 250000;
... SNIP ...
5 rows in set (4.89 sec)
The reason for the degrading performance is that MySQL has to generate the entire result set, then throw away everything below the offset. However, we can see that the performance is tolerable up to a point. We can certainly afford to pick from the first few thousand rows (or hundred, depending on your requirements).
Let’s implement the random offset:
<?php
$n = 4; # gimme this many
$k = 5;
$limit = $n*$k;
$order_candidates = array('date_available', 'date_available DESC');
$order_rnd = array_rand($order_candidates); # separated for better cacheability
$order_by = $order_candidates[$order_rnd];
$offset_rnd = rand(0, 20); # separated for better cacheability
$offset = $offset_rnd * 100;
$q = "SELECT * FROM bicycles WHERE in_stock = 1 $order_by LIMIT $limit OFFSET $offset";
$rows = mysql_give_me_the_fucking_rows($q); # AWESOME!
for ($i=0, $random_items = array(); $i<$n; ++$i){
$random_items[] = reset(array_splice($rows, array_rand($rows), 1));
}
?>
If you just can’t sleep well, knowing that you are not pulling rows from the entire table, you should really seek professional help – something far beyond whatever I can provide you with. Seriously, who gives a shit? However, I do have something to tide you over. Among other places, I’ve seen this workaround suggested in this MySQL Performance Blog Post. If you have a sequential column you can use (i.e., that is included in an index you can use for your query), you could use a WHERE col BETWEEN x AND y clause to simulate random offsets. Caveat: data may not be evenly distributed, so you might hit a range that has no rows in it, or not enough rows for your random sample.
For this scenario, MySQL does do some very handy things. If your table looks like mine, i.e. has an index on the column we’re selecting by AND the column we want to use for our offset, you can get the MIN and MAX bounds FO FREE**:
mysql> select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1;
+---------------------+---------------------+
| MAX(date_available) | MIN(date_available) |
+---------------------+---------------------+
| 1253385344 | 1249496986 |
+---------------------+---------------------+
1 row in set (0.00 sec)
mysql> explain select MAX(date_available),MIN(date_available) FROM bicycle WHERE in_stock=1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: NULL
type: NULL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: NULL
Extra: Select tables optimized away <<< fucking sweet
Once you have the minimum and maximum timestamps, you can just pick a random range. As I said above, you could get yourself an empty result set, so be sure you only do this with uniformly distributed data and a wide enough range.
Although the entire purpose of this post is to show how to do these things in a cheap way, it’s still important to remember: MySQL time is much more expensive than memcache time. For something like this, you should definitely be caching. The strategy is very simple: generate your query params (limit, order direction, offset), then use those to build your key. Check that you don’t already have this result set in cache; if you don’t, query the database and store it in the cache using that key. The next time you get the same random criteria, you’ll be able to get the rows – FO FREE***. For our last code example, the cache key could look like this:
$key = "BicycleRandom_$limit_{$order_rnd}_{$offset_rnd}";
That’ll do it!
* By ‘FO FREE’ I mean free, aside from the cost of maintaining the index
** This time, I really did mean FREE
*** Or almost free, anyway
(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)
After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.
The code can be found on github.
Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)
Output might look like this (fake info):
awk '{print $1}' access.log | streamstats.py
192.168.10.100 | 1
192.168.10.101 | 1
192.168.10.102 |||| 4
192.168.10.103 |||| 4
192.168.10.104 | 1
192.168.10.105 | 1
192.168.10.106 ||||| 5
192.168.10.107 |||| 4
192.168.10.108 ||||||||||| 11 *
192.168.10.109 |||||| 6
192.168.10.110 ||| 3
192.168.10.111 | 1
Some Statsy Things
count 12
outliers 1
maximum 11
minimum 1
stdev 2.84312035154
total 42
mean 3.5
As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.
The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.
The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?
My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.
Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.
Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:
However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.
Instead, I think it might be easier to take advantage of named groups and add the following arguments:
^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.
I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.
NOTE: I have submitted a patch to have this functionality as part of Django core. I’m posting this code here in case the patch gets rejected, since similar motions have previously been rejected. The ticket is at http://code.djangoproject.com/ticket/3818, and I will update this post if the patch gets accepted.
I find it very handy to be able to expose parts of settings.py in templates. The Django docs show you how to do this using context processors and give the code for the MEDIA_URL context processor here. However, it seems silly to me to have to write custom processors all the time. Instead, I propose a single processor that exposes a list of settings (stored in settings.py itself) to the template layer.
The gist containing the code is here: http://gist.github.com/155894
Enjoy.
I find daemontools to be incredibly useful for quickly turning scripts into daemons. I also like to run the latest version of Ubuntu, which means there is no (well-maintained) package available. There’s a package in ‘universe’ but it puts things in strange places and generally inspires distrust (explained below). As installing daemontools from source on Ubuntu hasn’t been straight forward (I had trouble both on 8.10 and 9.04), I’m gonna post a quick howto which pulls together a few pieces of information scattered throughout the internets.
daemontools comes with a bunch of tools. The full docs are here, but I’ll cover the basics.
In an ideal world, you throw a bunch of shit into /service, svscan picks it up and fires up a supervise process for all the subdirectories. It runs as a daemon and restarts supervise whenever needed (svscan itself is basically bulletproof in my experience). When your machine is rebooted for whatever reason, svscanboot restarts svscan and everything resumes as before. The reason I choose not to use the Ubuntu package is because it does not create this ideal world for me.
First of all, daemontools won’t compile as is. Follow along the instructions here until you get to the exciting part where you type ‘package/install’ and then get an error message that looks like this:
/usr/bin/ld: errno: TLS definition in /lib/libc.so.6 section .tbss mismatches non-TLS reference in envdir.o /lib/libc.so.6: could not read symbols: Bad value collect2: ld returned 1 exit status make: *** [envdir] Error 1
Never fret. I have found a couple of possible fixes, but the patch below appears to be the cleanest one. [1]
diff -ur daemontools-0.76.old/src/error.h daemontools-0.76/src/error.h --- daemontools-0.76.old/src/error.h 2001-07-12 11:49:49.000000000 -0500 +++ daemontools-0.76/src/error.h 2003-01-09 21:52:01.000000000 -0600 @@ -3,7 +3,7 @@ #ifndef ERROR_H #define ERROR_H -extern int errno; +#include <errno.h> extern int error_intr; extern int error_nomem;
Assuming you are still at the point in the installation instructions right before the ‘package/install’ command, copy the patch into /tmp/daemontools-errno.patch, then
$ cd src $ patch < /tmp/daemontools-errno.patch $ cd ..Then resume the original daemontools instructions as prescribed.
The installation creates two folders: /command and /service. The former contains all the different daemontools commands, while the latter is the default folder which the svscan daemon scans for jobs. To test that our installation worked, let’s create a proper test job (you’ll probably want to be superuser for this, since /service is owned by root by default)
$ cd /service $ mkdir tester $ mkdir tester/supervise $ touch tester/run $ chmod +x tester/runin tester/run, put any script. It can be empty, but if you really want to verify things are working, make it output something to a tmp file. Sample:
#!/bin/bash echo "Foo!" >> /tmp/foo.log sleep 5
Once everything is in place, as superuser run the following commands. If your output looks different, something’s broke.
$ /command/svscan /service & [1] 9641 $ ps -ef | grep svscan root 9641 17108 0 18:12 pts/1 00:00:00 /command/svscan /service mikhailp 9646 17108 0 18:12 pts/1 00:00:00 grep svscan $ ps -ef | grep supervise root 9642 9641 0 18:12 pts/1 00:00:00 supervise tester mikhailp 9660 17108 0 18:12 pts/1 00:00:00 grep superviseIf you made your tester script log something, check that the output is showing up where expected. If all is well, shut ‘er down and move on.
$ killall svscan $ killall supervise
Assuming you want to user supervise for some critical offline part of your application, you want it to start automatically whenever your server boots up. You don’t want to have to manually start up svscan every time.
Ubuntu uses upstart for task management instead of sysvinit. In order to let upstart start and stop the svscan daemon correctly, the following script needs to be copied into /etc/event.d/svscan [2]
# svscan - daemontools # # This service starts daemontools from the point the system is # started until it is shut down again. start on startup start on runlevel 1 start on runlevel 2 start on runlevel 3 start on runlevel 4 start on runlevel 5 start on runlevel 6 stop on shutdown exec /command/svscanboot
Once that’s in place, you can use start svscan and stop svscan. Try starting it and checking ps output as described in the Testing section. Finally, reboot the box (if possible) and repeat the same checks. At this point, svscan should start up with your system and spawn supervise processes on any job it finds in /service.
That’s all there is to it. Enjoy.
1: Patch originally found here, linked from here. 2: Corrected from this entry posted in 2006 which has some obsolete info
On one of my projects (the one that’s kept me too busy to blog), I had to work quite a bit with the google data API. I was using Django, so I was using the gdata python client.
The client is, for the most, part excellent. It abstracts away the parts of OAuth that you never ever want to have to know anything about or debug. All the signing, encoding, and decoding takes place inside the library – where it should happen.
The spec is somewhat young, and the library is even more so; there isn’t a whole lot out there covering the subject. The client comes with some sample scripts, but they don’t make it very clear how to set things up in a web app workflow. I’m going to try to fill the void somewhat with a sample Django app and a somewhat extensive writeup. You don’t have to know/understand Django to follow along, but it may help.
Disclaimer: The sample app is written in the simplest possible way, so as to focus on the code that actually has to do with OAuth and the GData API. When I say “sample,” I really fucking mean it. I pay no attention to things like security and good practices, because I don’t give a shit about those things, and neither should you. The purpose is to understand how Google’s python client works. If you copy and paste this code into your production app and push your key, secret, and RSA creds to a public github repository, that is your own stupid fault.
There are many better resources for learning about OAuth proper, so I’ll cover the basics. OAuth is a protocol for securing communication between an API (provider) and an app that wants to use said API (consumer). It’s fairly complicated, but due to Google having a pretty nice client wrapped around their API, most of that is abstracted away. What we need to know is the general OAuth workflow:
Simple, right? I’ve left out all the encryption details (those are still murky to me as well), but luckily we don’t have to worry about that too much. I’ll reference this list of steps as I go along – mostly to keep myself sane. Let’s call a spade a fucking spade – shit is confusing! More info on OAuth and Google’s implementation of it can be found here. If you want to be really hardcore, you can read the spec here.
Some key terms to remember along the way:
First and foremost, go here and sign up for an API key. Pretty straight forward. Store your App key and secret somewhere – I’m using the settings.py file in my sample app, though that may not be optimal security-wise. You’ll also see an rsa_key in my GDATA_CREDS dictionary. That is there because my example uses RSA_SHA1 encryption. You don’t have to use that, but I recommend it, and the example will be easier to follow. I’ve seen in some documentation that may or may not be outdated that the contacts API only accepts RSA encryption, though I haven’t tried HMAC myself. To setup your RSA encryption, follow the steps here. You should be able to accomplish this on most unix/linux boxes.
Now that you have your app key and secret, and you’ve set up RSA encryption, you should download and install the the gdata lib itself. Get it here.
Before you start using tokens, you need some way to store them. A token consists of a token key and a token secret. You could create a database table to store those things individually. Me, I’m lazy – I just pickle my entire gdata token object. It lets me rebuild a functioning token object on the spot when I fetch it from the database. So for my case, I just have a text field for storing the token itself. See models.py in the sample app.
You could probably just as easily define a model with token and token secret fields and a method to return an initialized token object. Whatever your preference, define an appropriate model. You also need a way to tie the tokens to a user. My sample app does this using a foreign key to the django.contrib.auth.models.User model.
Now we’re ready to actually start down the path towards getting the OAuth Access Token – the jackpot of API access. Seriously, you will feel a sense of palpable relief when you finally get one working. Like when passing a pineapple.
If you’re following along in the app, we’re going to skip over shit like user signup and auth (it’s only there so that people could set this up on their server and really see it work) and go straight for the money: getting the OAuth token. Look at /oauth/views.py->add_token. Approximate line numbers will be given in square brackets.
The first line of the view [~97] fetches the right scope for dealing with the contacts API. The scopes determine what part of the user’s data your token will have access to and are not part of the OAuth spec. They are stored under some pretty cryptic keys, so you may just have to look in the gdata source under gdata/service.py to find the right ones. I just want to use the contacts feed for the example, so I put ‘cp’ (what else would it be, right?)
Next, we initialize the gd_client variable, which is the object that will take care of all of our OAuth needs and communicate with TEH GOOGALZ [~99]. You will see that I tell it to use RSA encryption and pass it my app’s API credentials from settings.py. There are a few different subclasses to the base client that provide more specific methods for dealing with individual feeds (contacts, blogger, docs, etc). We’ll just use the base client for initializing our token.
Now we arrive at a conditional to check what part of the process we’re on [~107]. If we don’t have a ‘oauth_token’ get parameter, that means that we’re at step 3 in our process – we have to send the user to Google to hit the big Authorize button.
The first thing we do is initialize a request token (rt). We then store the ‘secret’ part of that token. I store it in a session for simplicity/readability, but it can be stored anywhere.
We then tell our client to use this request token and to generate an authorization URL for that token [~116]. That’s the URL we will send our user to in order for them to authorize us. Note that in the callback we just pass the current URL. When the user clicks ‘Allow’ on Google’s authorization page, they will be sent right back to this same view, but this time with the ‘oauth_token’ parameter, meaning they will hit the ‘else’ part of our conditional.
When we do get back to the page, we’ll need to reconstruct our token [~127]. Google has modified our token and marked it authorized, but we still need to put back our piece of the puzzle – the token secret. Since we stored it in a session variable, we retrieve it with ease. Since our token now has no context, we have to tell it what scopes we’re interested it in as well.
So what we have now is an authorized request token. We’re still not out of the woods. Now we need to upgrade our token to an actual access token. This is done via the UpgradeToOAuthAccessToken method of the gdata client [~135].
As the comments in my code indicate, this part is a tad shifty [~141], at least as of version 1.2.4 of the python client. Since upgrading the token modifies it once more, we need to store the end result of the upgrade. However, the upgrade method does not return that modified token, it just stores it in the client’s token store, hence the find_token call. See discussion here for more info. UPDATE 05/05/2009: the patch to fix this awkwardness just got committed to SVN, so hopefully the next version of the library will no longer require the extra step.
Regardless, once we’ve retrieved the authorized token (at), we can store it [~145]. As I had mentioned, I just pickle it into a TextField.
That’s it. Once you’ve saved that token, you can reconstruct it at any time to fetch any data that token is authorized for (as in, which scopes you specified when sending the token to Google).
To see it in action, just take a look at the views.py->home. First, we find the latest OAuth token in our database [~66]. If we don’t have one, we redirect the user to the add_token view to go through the OAuth process [~86]. If we do have one, we fetch some data.
We create and initialize a ContactsService client [~71-78] and then pass it our token [~79-80]. Then BAM, we fetch some contacts from our user’s contacts feed.
That’s all for now. Hopefully this overview is helpful in getting everybody and their mother developing apps using the Google APIs. I welcome corrections and additions, and will update this post if I get any important ones.
I’m leaving for France tomorrow. I’m not taking my laptop, I won’t be using data on my phone (AT&T international data rates are something else), and I won’t be checking my email.
If you need to reach me, shoot me an email and I will get back to you promptly – on or after March 22nd