Just wanted to post a quick note about the state of streamstats, the little tool I’ve been working on for analyzing logs/data files. Things stalled a bit when I started trying to implement time awareness, as it turned out that Python’s time parsing capabilities are limited, to put it nicely. I even tried to use regex to find a matching pattern before parsing the date, but I was unable to parse common date formats found in the logs this tool is intended to parse (namely, apache logs; and no, changing the date format for all of Flickr’s hosts is not a fucking option, ok?) This was unacceptable.
I quickly recreated the basic functionality in PHP, using the famed strtotime function. However, then I looked at the getopt() implementation available stock with PHP and realized I was either going to have to package a third party option (the pickings there were also slim), write my own lib to do it, or write a whole shitload of custom code specifically for streamstats. However, the first option was not attractive due to the fact that I’d have to create a package for it for Yahoo!’s packaging system, and the other two are unattractive because… well, I’m trying to write a fucking stats analysis function, not options handling code.
That means streamstats is being rewritten, for the 3rd time. in Perl. I’ll be using Getopt::Long and Date::Manip to keep the auxiliary logic out of the script. Luckily, the basic functionality won’t take long to recreate, and the features I’ve been trying to add shouldn’t be too bad either. Plus, I get to finally re-learn Perl.
Things to look forward to:
- Time awareness (i.e. calculating how many times per second a certain value occurs, a distribution of frequencies, limiting the timeframe of interest)
- proper histograms, with buckets etc
- custom patterns (i.e., being able to specify which part of the incoming string is the time and which part is the value, for those that don’t want to use grep/awk to narrow it down beforehand
- multiple column comparison and relevant stats (gonna have to bust out the textbook on this one)
The next couple of months promise to be exciting in terms of shipping things, both at Flickr and outside. I’m looking forward to posting actual code for streamstats.pl.
Tags: perl, statistics, streamstats