Posts Tagged ‘statistics’

Streamstats status update

Monday, October 19th, 2009

Just wanted to post a quick note about the state of streamstats, the little tool I’ve been working on for analyzing logs/data files. Things stalled a bit when I started trying to implement time awareness, as it turned out that Python’s time parsing capabilities are limited, to put it nicely. I even tried to use regex to find a matching pattern before parsing the date, but I was unable to parse common date formats found in the logs this tool is intended to parse (namely, apache logs; and no, changing the date format for all of Flickr’s hosts is not a fucking option, ok?) This was unacceptable.

I quickly recreated the basic functionality in PHP, using the famed strtotime function. However, then I looked at the getopt() implementation available stock with PHP and realized I was either going to have to package a third party option (the pickings there were also slim), write my own lib to do it, or write a whole shitload of custom code specifically for streamstats. However, the first option was not attractive due to the fact that I’d have to create a package for it for Yahoo!’s packaging system, and the other two are unattractive because… well, I’m trying to write a fucking stats analysis function, not options handling code.

That means streamstats is being rewritten, for the 3rd time. in Perl. I’ll be using Getopt::Long and Date::Manip to keep the auxiliary logic out of the script. Luckily, the basic functionality won’t take long to recreate, and the features I’ve been trying to add shouldn’t be too bad either. Plus, I get to finally re-learn Perl.

Things to look forward to:

  • Time awareness (i.e. calculating how many times per second a certain value occurs, a distribution of frequencies, limiting the timeframe of interest)
  • proper histograms, with buckets etc
  • custom patterns (i.e., being able to specify which part of the incoming string is the time and which part is the value, for those that don’t want to use grep/awk to narrow it down beforehand
  • multiple column comparison and relevant stats (gonna have to bust out the textbook on this one)

The next couple of months promise to be exciting in terms of shipping things, both at Flickr and outside. I’m looking forward to posting actual code for streamstats.pl.

streamstats: easier log analysis

Saturday, August 22nd, 2009

(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)

Background

After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.

The code can be found on github.

Basic Functionality

Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)

Output might look like this (fake info):

awk '{print $1}' access.log | streamstats.py

192.168.10.100 |                                        1 192.168.10.101 |                                        1 192.168.10.102 ||||                                     4 192.168.10.103 ||||                                     4 192.168.10.104 |                                        1 192.168.10.105 |                                        1 192.168.10.106 |||||                                    5 192.168.10.107 ||||                                     4 192.168.10.108 |||||||||||                              11 * 192.168.10.109 ||||||                                   6 192.168.10.110 |||                                      3 192.168.10.111 |                                        1

Some Statsy Things count 12 outliers 1 maximum 11 minimum 1 stdev 2.84312035154 total 42 mean 3.5

As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.

Moar! Time Awareness

The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.

Implementation Possibilities

The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?

Candidate #1: awk and friends

My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.

Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.

Candidate #2: Regular Expressions

Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:

  • -t –time to let streamstats know that we’re interested in time, with an optional argument that would be the regex pattern used to find the time
  • -p –pattern to specify the pattern used to actually parse the value out of every line of input

However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.

Instead, I think it might be easier to take advantage of named groups and add the following arguments:

  • -t –time to let streamstats know that we’re interested in time
  • -p –pattern to specify the regular expression that will parse out both the time and the value. In the absence of -t, the default pattern would be ^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.

The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.

I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.