Posts Tagged ‘stats’

streamstats: easier log analysis

Saturday, August 22nd, 2009

(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)

Background

After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.

The code can be found on github.

Basic Functionality

Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)

Output might look like this (fake info):

awk '{print $1}' access.log | streamstats.py

192.168.10.100 |                                        1 192.168.10.101 |                                        1 192.168.10.102 ||||                                     4 192.168.10.103 ||||                                     4 192.168.10.104 |                                        1 192.168.10.105 |                                        1 192.168.10.106 |||||                                    5 192.168.10.107 ||||                                     4 192.168.10.108 |||||||||||                              11 * 192.168.10.109 ||||||                                   6 192.168.10.110 |||                                      3 192.168.10.111 |                                        1

Some Statsy Things count 12 outliers 1 maximum 11 minimum 1 stdev 2.84312035154 total 42 mean 3.5

As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.

Moar! Time Awareness

The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.

Implementation Possibilities

The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?

Candidate #1: awk and friends

My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.

Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.

Candidate #2: Regular Expressions

Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:

  • -t –time to let streamstats know that we’re interested in time, with an optional argument that would be the regex pattern used to find the time
  • -p –pattern to specify the pattern used to actually parse the value out of every line of input

However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.

Instead, I think it might be easier to take advantage of named groups and add the following arguments:

  • -t –time to let streamstats know that we’re interested in time
  • -p –pattern to specify the regular expression that will parse out both the time and the value. In the absence of -t, the default pattern would be ^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.

The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.

I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.