<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: streamstats: easier log analysis</title>
	<atom:link href="http://mihasya.com/blog/streamstats-easier-log-analysis/feed/" rel="self" type="application/rss+xml" />
	<link>http://mihasya.com/blog/streamstats-easier-log-analysis/</link>
	<description>good things now come in packages of three</description>
	<lastBuildDate>Sat, 23 Jul 2011 22:40:16 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
	<item>
		<title>By: mihasya</title>
		<link>http://mihasya.com/blog/streamstats-easier-log-analysis/comment-page-1/#comment-3689</link>
		<dc:creator>mihasya</dc:creator>
		<pubDate>Sat, 02 Jan 2010 07:52:34 +0000</pubDate>
		<guid isPermaLink="false">http://mihasya.com/blog/?p=288#comment-3689</guid>
		<description>&lt;p&gt;Ooops never replied to this.&lt;/p&gt;

&lt;p&gt;The problem is that the data is arbitrary. While everything you suggest is absolutely correct, this ain&#039;t the arena for it. This is meant simply as a quick and dirty tool to help parse logs on the spot to detect unusual activity, not long-term analysis. Most of the time, the red highlight is unnecessary anyway - you&#039;ll be able to tell immediately which value occurs an unusual number of times, regardless of what the distribution is. I may do away with it entirely, since it is, as you pointed out, not entirely correct.&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>Ooops never replied to this.</p>

<p>The problem is that the data is arbitrary. While everything you suggest is absolutely correct, this ain&#8217;t the arena for it. This is meant simply as a quick and dirty tool to help parse logs on the spot to detect unusual activity, not long-term analysis. Most of the time, the red highlight is unnecessary anyway &#8211; you&#8217;ll be able to tell immediately which value occurs an unusual number of times, regardless of what the distribution is. I may do away with it entirely, since it is, as you pointed out, not entirely correct.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Hall</title>
		<link>http://mihasya.com/blog/streamstats-easier-log-analysis/comment-page-1/#comment-2909</link>
		<dc:creator>David Hall</dc:creator>
		<pubDate>Sun, 23 Aug 2009 14:55:33 +0000</pubDate>
		<guid isPermaLink="false">http://mihasya.com/blog/?p=288#comment-2909</guid>
		<description>&lt;p&gt;before you get too far, I would suggest doing some plots just to test what kind of distribution you have.  You&#039;re making an assumption that mean and standard deviation are meaningful values, when they might not be.  I see people do it all the time and draw pretty silly conclusions.&lt;/p&gt;

&lt;p&gt;Plot yourself a histogram where you are counting how many people make 1 request, 2 requests, etc (bins may need to be wider, depending on how much data you have, but I feel like you can get a lot).  If you actually get a Gaussian, then you&#039;re okay using those, but you might get something else.  In which case, hit the statistics books some more to try to decide what you have.  (one hint, if it&#039;s bimodal, you probably have the sum of two distributions, say one is normal users, one is power users, in which case, you may be able to split out power users, treat them as a gaussian, then look for outliers 2 or 3 std dev above that mean.)&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>before you get too far, I would suggest doing some plots just to test what kind of distribution you have.  You&#8217;re making an assumption that mean and standard deviation are meaningful values, when they might not be.  I see people do it all the time and draw pretty silly conclusions.</p>

<p>Plot yourself a histogram where you are counting how many people make 1 request, 2 requests, etc (bins may need to be wider, depending on how much data you have, but I feel like you can get a lot).  If you actually get a Gaussian, then you&#8217;re okay using those, but you might get something else.  In which case, hit the statistics books some more to try to decide what you have.  (one hint, if it&#8217;s bimodal, you probably have the sum of two distributions, say one is normal users, one is power users, in which case, you may be able to split out power users, treat them as a gaussian, then look for outliers 2 or 3 std dev above that mean.)</p>
]]></content:encoded>
	</item>
</channel>
</rss>

