streamstats: easier log analysis

(Since I’m tired of WordPress’s wysiwig editor, I’m writing my posts in mostly markdown going forward. Eventually I’ll probably switch this blog over to another platform that supports markdown editing with less pain – likely Richard Crowley’s bashpress)

Background

After watching our Ops team try to figure out if a certain IP was malicious by piping access log output into all sorts of utilities, I decided to whip up something that would simplify that task. The first pass resulsted in a 40 line script that took me a bit longer than it should have – I had forgotten statistics completely and, to my embarrassment, didn’t remember the printf syntax well enough to format a histogram.

The code can be found on github.

Basic Functionality

Regardless, the initial version of the script is quite simple: it takes stdin and generates a histogram, as well as some statistics for values in the stream. Effectively, it assumes that by the time you’re piping things into it, you’ve narrowed it down to just the piece of information you’re interested in (an IP address, an API key, etc)

Output might look like this (fake info):

awk '{print $1}' access.log | streamstats.py

192.168.10.100 |                                        1 192.168.10.101 |                                        1 192.168.10.102 ||||                                     4 192.168.10.103 ||||                                     4 192.168.10.104 |                                        1 192.168.10.105 |                                        1 192.168.10.106 |||||                                    5 192.168.10.107 ||||                                     4 192.168.10.108 |||||||||||                              11 * 192.168.10.109 ||||||                                   6 192.168.10.110 |||                                      3 192.168.10.111 |                                        1

Some Statsy Things count 12 outliers 1 maximum 11 minimum 1 stdev 2.84312035154 total 42 mean 3.5

As of this writing, I’ve added the -o option, which will limit the histogram output to show only outliers.

Moar! Time Awareness

The natural next step is to make streamstats time-aware. The ability to pipe an entire log through it and see statistics like requests per second per IP would greatly increase its usefulness. As it is now, a user has to first track down when a potential attack was happening, isolate just those log lines, then pipe them through. Making streamstats time-aware would allow that user to just specify a start time and an end time as parameters.

Implementation Possibilities

The question is, how to add the ability to recognize timestamps in logfiles without making things overly complicated?

Candidate #1: awk and friends

My initial thought was to allow the user to supply command sequences as arguments that would tell streamstats how to locate the timestamp and the acutal value of interest. The reasoning was to take advantage of the abundance of *nix utilities aleady available and the likely expert knowledge of said utilities possessed by the average expected user of streamstats.

Unfortunately (or, perhaps, fortunately), this was a non-starter. Shell-level string interpolation makes this awkward to the point of uselessness, if not impossible.

Candidate #2: Regular Expressions

Another tool that I would assume most sysadmins are familiar with, if nothing else by way of grep, is regular expressions. Those are a lot easier to implement – the script can just take regex patterns as arguments. My initial thoughts were to introduce the following parameters:

  • -t –time to let streamstats know that we’re interested in time, with an optional argument that would be the regex pattern used to find the time
  • -p –pattern to specify the pattern used to actually parse the value out of every line of input

However, there are some problems with this. It leaves the door open for the user to specify a time pattern and not a value pattern, resulting in garbage. I could force the user to add a -p pattern if -t is specified, but it just seems unnatural.

Instead, I think it might be easier to take advantage of named groups and add the following arguments:

  • -t –time to let streamstats know that we’re interested in time
  • -p –pattern to specify the regular expression that will parse out both the time and the value. In the absence of -t, the default pattern would be ^(?P<value>.+)\n$ (the whole line; in fact, regular expressions will just be bypassed). In the presence of -t, the default pattern would be ^\[(?P<time>.+)\]\s*(?P<value>.+)\n$.

The advantage of this approach is that the default pattern allows the user to avoid having to specify a pattern altogether – they can just pipe the input to streamstats using awk '{printf "[%s] %s\n", $1, $2}' or something similar. However, the possibility remains to simply come up with the regex that parses out what’s needed from each line of input, thus eliminating the need to use other tools to manipulate the stream prior to streamstats. A possible downstream feature would be letting users maintain a list of frequently used patterns (probably some place like ~/.streamstats), avoiding the time usually spent conjuring up the right sequence of awk and grep.

I have not yet started implementing the time features, so any thoughts or suggestions would be deeply appreciated.

Tags: , , , , ,

2 Responses to “streamstats: easier log analysis”

  1. David Hall says:

    before you get too far, I would suggest doing some plots just to test what kind of distribution you have. You’re making an assumption that mean and standard deviation are meaningful values, when they might not be. I see people do it all the time and draw pretty silly conclusions.

    Plot yourself a histogram where you are counting how many people make 1 request, 2 requests, etc (bins may need to be wider, depending on how much data you have, but I feel like you can get a lot). If you actually get a Gaussian, then you’re okay using those, but you might get something else. In which case, hit the statistics books some more to try to decide what you have. (one hint, if it’s bimodal, you probably have the sum of two distributions, say one is normal users, one is power users, in which case, you may be able to split out power users, treat them as a gaussian, then look for outliers 2 or 3 std dev above that mean.)

  2. mihasya says:

    Ooops never replied to this.

    The problem is that the data is arbitrary. While everything you suggest is absolutely correct, this ain’t the arena for it. This is meant simply as a quick and dirty tool to help parse logs on the spot to detect unusual activity, not long-term analysis. Most of the time, the red highlight is unnecessary anyway – you’ll be able to tell immediately which value occurs an unusual number of times, regardless of what the distribution is. I may do away with it entirely, since it is, as you pointed out, not entirely correct.

Leave a Reply