Archive for the ‘python’ Category

streamstats: easier log analysis

Saturday, August 22nd, 2009

(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)

Background

After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.

The code can be found on github.

Basic Functionality

Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)

Output might look like this (fake info):

awk '{print $1}' access.log | streamstats.py

192.168.10.100 |                                        1 192.168.10.101 |                                        1 192.168.10.102 ||||                                     4 192.168.10.103 ||||                                     4 192.168.10.104 |                                        1 192.168.10.105 |                                        1 192.168.10.106 |||||                                    5 192.168.10.107 ||||                                     4 192.168.10.108 |||||||||||                              11 * 192.168.10.109 ||||||                                   6 192.168.10.110 |||                                      3 192.168.10.111 |                                        1

Some Statsy Things count 12 outliers 1 maximum 11 minimum 1 stdev 2.84312035154 total 42 mean 3.5

As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.

Moar! Time Awareness

The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.

Implementation Possibilities

The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?

Candidate #1: awk and friends

My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.

Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.

Candidate #2: Regular Expressions

Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:

  • -t –time to let streamstats know that we’re interested in time, with an optional argument that would be the regex pattern used to find the time
  • -p –pattern to specify the pattern used to actually parse the value out of every line of input

However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.

Instead, I think it might be easier to take advantage of named groups and add the following arguments:

  • -t –time to let streamstats know that we’re interested in time
  • -p –pattern to specify the regular expression that will parse out both the time and the value. In the absence of -t, the default pattern would be ^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.

The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.

I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.

Django: exposing settings in templates, the easy way

Sunday, July 26th, 2009

NOTE: I have submitted a patch to have this functionality as part of Django core. I’m posting this code here in case the patch gets rejected, since similar motions have previously been rejected. The ticket is at http://code.djangoproject.com/ticket/3818, and I will update this post if the patch gets accepted.

I find it very handy to be able to expose parts of settings.py in templates. The Django docs show you how to do this using context processors and give the code for the MEDIA_URL context processor here. However, it seems silly to me to have to write custom processors all the time. Instead, I propose a single processor that exposes a list of settings (stored in settings.py itself) to the template layer.

The gist containing the code is here: http://gist.github.com/155894

Enjoy.

Google Data API With OAuth Using the GData Python Client

Wednesday, April 22nd, 2009

On one of my projects (the one that’s kept me too busy to blog), I had to work quite a bit with the google data API. I was using Django, so I was using the gdata python client.

The client is, for the most, part excellent. It abstracts away the parts of OAuth that you never ever want to have to know anything about or debug. All the signing, encoding, and decoding takes place inside the library – where it should happen.

The spec is somewhat young, and the library is even more so; there isn’t a whole lot out there covering the subject. The client comes with some sample scripts, but they don’t make it very clear how to set things up in a web app workflow. I’m going to try to fill the void somewhat with a sample Django app and a somewhat extensive writeup. You don’t have to know/understand Django to follow along, but it may help.

Disclaimer: The sample app is written in the simplest possible way, so as to focus on the code that actually has to do with OAuth and the GData API. When I say “sample,” I really fucking mean it. I pay no attention to things like security and good practices, because I don’t give a shit about those things, and neither should you. The purpose is to understand how Google’s python client works. If you copy and paste this code into your production app and push your key, secret, and RSA creds to a public github repository, that is your own stupid fault.

OAuth Crash Course

There are many better resources for learning about OAuth proper, so I’ll cover the basics. OAuth is a protocol for securing communication between an API (provider) and an app that wants to use said API (consumer). It’s fairly complicated, but due to Google having a pretty nice client wrapped around their API, most of that is abstracted away. What we need to know is the general OAuth workflow:

  1. User invokes a part of the consumer app that requires access to the provider’s API
  2. If the consumer already has an OAuth Access Token for that user, skip down to step 6
  3. If not, the consumer fetches an OAuth token key and token secret from the provider. The token secret is stored. The consumer then directs the user to an Authorization URL provided by the provider, with the token key as a query parameter, along with a “callback” URL pointing back to the consumer
  4. The user authorizes the consumer to use the provider’s data; the provider redirects the user back to the callback URL with the “Authorized Request Token” ‘oauth_token’ query variable set to its token key. The consumer can then request an Access Token from the provider by sending this token key back with the token secret that was saved in the previous step
  5. The provider returns the Access Token. The consumer can now use this Access Token to access the user’s data through the provider’s API until the user elects to revoke the Access Token
  6. The consumer does magic tricks with user’s data, including making it disappear

Simple, right? I’ve left out all the encryption details (those are still murky to me as well), but luckily we don’t have to worry about that too much. I’ll reference this list of steps as I go along – mostly to keep myself sane. Let’s call a spade a fucking spade – shit is confusing! More info on OAuth and Google’s implementation of it can be found here. If you want to be really hardcore, you can read the spec here.

Some key terms to remember along the way:

  • Provider – service with the user’s data (in this case, Google)
  • Consumer – the app that wants to access the user’s data (your app)
  • Request Token – refers to a token used by the consumer to request access from the provider/user. It can be authorized and unauthorized. The token is initially given by the provider to the consumer in an unauthorized state. The consumer has to then direct the user to the provider’s authorization page with the unauthorized token. If the user accepts, the provider will send the user back to the consumer with an authorized request token.
  • Access Token – refers to a token used by the consumer to actually authenticate with the provider. The consumer has to exchange an authorized request token for an access token before it can access data. The access token is what gives the consumer long-term access to the user’s data.

Getting Started

First and foremost, go here and sign up for an API key. Pretty straight forward. Store your App key and secret somewhere – I’m using the settings.py file in my sample app, though that may not be optimal security-wise. You’ll also see an rsa_key in my GDATA_CREDS dictionary. That is there because my example uses RSA_SHA1 encryption. You don’t have to use that, but I recommend it, and the example will be easier to follow. I’ve seen in some documentation that may or may not be outdated that the contacts API only accepts RSA encryption, though I haven’t tried HMAC myself. To setup your RSA encryption, follow the steps here. You should be able to accomplish this on most unix/linux boxes.

Now that you have your app key and secret, and you’ve set up RSA encryption, you should download and install the the gdata lib itself. Get it here.

Storing the Token

Before you start using tokens, you need some way to store them. A token consists of a token key and a token secret. You could create a database table to store those things individually. Me, I’m lazy – I just pickle my entire gdata token object. It lets me rebuild a functioning token object on the spot when I fetch it from the database. So for my case, I just have a text field for storing the token itself. See models.py in the sample app.

You could probably just as easily define a model with token and token secret fields and a method to return an initialized token object. Whatever your preference, define an appropriate model. You also need a way to tie the tokens to a user. My sample app does this using a foreign key to the django.contrib.auth.models.User model.

Creating a Token

Now we’re ready to actually start down the path towards getting the OAuth Access Token – the jackpot of API access. Seriously, you will feel a sense of palpable relief when you finally get one working. Like when passing a pineapple.

If you’re following along in the app, we’re going to skip over shit like user signup and auth (it’s only there so that people could set this up on their server and really see it work) and go straight for the money: getting the OAuth token. Look at /oauth/views.py->add_token. Approximate line numbers will be given in square brackets.

The first line of the view [~97] fetches the right scope for dealing with the contacts API. The scopes determine what part of the user’s data your token will have access to and are not part of the OAuth spec. They are stored under some pretty cryptic keys, so you may just have to look in the gdata source under gdata/service.py to find the right ones. I just want to use the contacts feed for the example, so I put ‘cp’ (what else would it be, right?)

Next, we initialize the gd_client variable, which is the object that will take care of all of our OAuth needs and communicate with TEH GOOGALZ [~99]. You will see that I tell it to use RSA encryption and pass it my app’s API credentials from settings.py. There are a few different subclasses to the base client that provide more specific methods for dealing with individual feeds (contacts, blogger, docs, etc). We’ll just use the base client for initializing our token.

Now we arrive at a conditional to check what part of the process we’re on [~107]. If we don’t have a ‘oauth_token’ get parameter, that means that we’re at step 3 in our process – we have to send the user to Google to hit the big Authorize button.

The first thing we do is initialize a request token (rt). We then store the ‘secret’ part of that token. I store it in a session for simplicity/readability, but it can be stored anywhere.

We then tell our client to use this request token and to generate an authorization URL for that token [~116]. That’s the URL we will send our user to in order for them to authorize us. Note that in the callback we just pass the current URL. When the user clicks ‘Allow’ on Google’s authorization page, they will be sent right back to this same view, but this time with the ‘oauth_token’ parameter, meaning they will hit the ‘else’ part of our conditional.

When we do get back to the page, we’ll need to reconstruct our token [~127]. Google has modified our token and marked it authorized, but we still need to put back our piece of the puzzle – the token secret. Since we stored it in a session variable, we retrieve it with ease. Since our token now has no context, we have to tell it what scopes we’re interested it in as well.

So what we have now is an authorized request token. We’re still not out of the woods. Now we need to upgrade our token to an actual access token. This is done via the UpgradeToOAuthAccessToken method of the gdata client [~135].

As the comments in my code indicate, this part is a tad shifty [~141], at least as of version 1.2.4 of the python client. Since upgrading the token modifies it once more, we need to store the end result of the upgrade. However, the upgrade method does not return that modified token, it just stores it in the client’s token store, hence the find_token call. See discussion here for more info. UPDATE 05/05/2009: the patch to fix this awkwardness just got committed to SVN, so hopefully the next version of the library will no longer require the extra step.

Regardless, once we’ve retrieved the authorized token (at), we can store it [~145]. As I had mentioned, I just pickle it into a TextField.

That’s it. Once you’ve saved that token, you can reconstruct it at any time to fetch any data that token is authorized for (as in, which scopes you specified when sending the token to Google).

To see it in action, just take a look at the views.py->home. First, we find the latest OAuth token in our database [~66]. If we don’t have one, we redirect the user to the add_token view to go through the OAuth process [~86]. If we do have one, we fetch some data.

We create and initialize a ContactsService client [~71-78] and then pass it our token [~79-80]. Then BAM, we fetch some contacts from our user’s contacts feed.

That’s all for now. Hopefully this overview is helpful in getting everybody and their mother developing apps using the Google APIs. I welcome corrections and additions, and will update this post if I get any important ones.

APIMuni by Danny Roa – Bringing NextMuni To The Masses

Saturday, February 28th, 2009

Danny Roa, whom I met at the last Django Meetup, has put out a quick API for accessing Nextbus data.

It’s hosted on the App Engine and can be found here.

His writeup is here.

He recycled the scraping code from yourmuni, props to him for giving props :) Of course, that just means that when Nextbus gets angry, they’re going to come after me first!

Developers don’t create API’s for nothing, so I am eagerly anticipating what Danny is going to use this API for.

Reusable Logging in Django Apps

Saturday, February 28th, 2009

I have 3 drafts sitting in my queue – 1 really long post and 2 short ones. I’ve been picking away at the long post on the shuttle rides, but in the meantime I’m gonna try to push out the two quick ones. This is one of the quick ones.

I was trying to figure out how to set up reusable logging in my apps and have it fairly decoupled from the overall project. Here’s what I came up with:

  1. Set up a logger object using these instructions in settings.py and store it in the LOGGER variable.
  2. Grab it inside apps using django.conf.settings like so:

from django.conf import settings
try:
    logging = settings.LOGGER
except AttributeError:
    import logging
Then just use logging.debug, logging.info etc. Thus, if a LOGGER is configured inside the project’s settings.py, we use that (django.conf.settings points to the settings.py for whatever project you’re working inside of, so you can move your app project to project no problem). Otherwise, we just use vanilla logging functions with the global logging configuration. Nice and sweet.

Suggestions on other ways to do this are, as always, welcome.

Example on django snippets: here.

Django Tip: Using Dictionaries For Model Method Parameters

Tuesday, February 3rd, 2009

I’ve been working a whole lot outside of my job, mostly writing Python and working with Django. I don’t have much energy for a real blog post about something awesome, but I do have a tip to share. Advanced “pythonistas” won’t be impressed, but I haven’t seen this documented prominently anywhere, so I’ll toss it up anyway.

As we all know, Python supports keyword arguments, and the Django ORM takes full advantage of this. When doing lookups, the ORM parses keyword parameters in order to determine what SQL query to execute. A typical ORM call will look like this:

all_oatmeal = Cookie.objects.filter(cookie_type='oatmeal')
That’s very cool and expressive. However, what if our search criteria depend somehow on user input? For example, what if we have a search form with multiple fields, but only want to search by the fields that a user entered something into.

We could have a series of convoluted if/else statements to determine which variables were set and have a corresponding .filter() call for each possibility, but that would be dumb, convoluted, and hard to read later. Also, dumb.

Instead, we can use an alternative way of passing keyword arguments provided by Python (details here): putting ** in front of a dictionary being passed to a function makes Python unpack the dictionary and pass the pairs as keyword arguments to the function. Using that technique, we can arbitrarily construct a dictionary of the search parameters, then pass it to a single .filter() call at the bottom.

An over-simplified example, in which I assume that our form fields match up exactly with model properties

for key in form_data:
    if form_data[key]=='':
        del form_data[key]
wanted_cookies = Cookie.objects.filter(**form_data)
I’m sure there’s a more elegant way to do the empty value stripping too, but that’s not our focus (comments on the subject are welcome, though, for to make me smarter). The point is this: this technique allows for very clean, easy to read, efficient code.

Creating model instances

cookie_data = {
    'cookie_type': 'oatmeal',
    'cookie_size': '3in',
    'cookie_touched': True
}
c = Cookie(**cookie_data)
c.save()
Obviously, this applies to more than just Django – there are many many use cases where this trick can come in handy. Enjoy!

Ternary in Python: The Cover Beats the Original

Saturday, January 24th, 2009

Like most newcomers to Python, I lamented the absence of a ternary operator. Some say the operator is hard to read, but I say those people need better reading glasses. If I want to assign a value based on a condition, I don’t think there’s anything clearer than an operator triad that exists solely for that purpose.

In any case, Python doesn’t have the standard = ? : ternary. It has a similar shorthand if-else construct, which I used begrudgingly. It looks like this:

result = value1 if condition else value2
THAT is hard to read. Code highlighting helps, but it’s still far from optimal.

This Kung Fu Is Weak!

Then I stumbled upon this random page while looking up exactly what the syntax was. The trick is to build a tuple on the fly and immediately select one of its elements using the condition as the index. The above code ends up looking like this:

result = (value1, value2)[condition]
How fucking awesomely elegant is that? As the post title suggests, I actually like this better than the original ternary.

As if to compliment this trick, the bool type in Python actually evaluates to a numeric 0 or 1. This makes the classic “Set the value to itself if it’s already set; set it to X otherwise” case incredibly easy:

value = (value, X)[bool(value)]
A frequent use case is one where you have a function argument that you want to default to a certain value calculated using another function. It can’t just be set in the function signature; you end up setting it to None in the signature (which, IMO, you should have done to begin with), then assigning it whatever the default value is if the caller doesn’t pass anything.

Here’s a real example: an implementation of Binary Search in Python. (I’ve been going through Introduction to Algorithms as part of my New Years resolution to become a better programmer; I hate writing pseudo code, so I’ve been writing Python for all the exercises)

UDPATE

Found an even better alternative for this case:

value = value or X
Example updated:
def binSearch(needle, haystack, start=None, end=None):
    start = start or 0
    end = end or len(haystack)-1

    midpt = start+int(math.floor((end-start)/2))     median = haystack[midpt]     if (needle > median):         return binSearch(needle, haystack, midpt, end)     elif (needle < median):         return binSearch(needle, haystack, start, midpt)     else:         return midpt

Enjoy.