For ShrimpTest to produce accurate statistics, it must be able to refer to the unit of a unique human visitor. This mythical notion is pretty hard to come by, but that doesn’t mean we can’t try to approximate it. To be clear, there are two complications in the notion of a unique human visitor: unique and human. Let’s break it down:

A common approach for getting unique visitors involves cookieing. First-party cookie acceptance (what ShrimpTest will be doing) is at approximately 98% while third-party cookie acceptance drops to ~90% according to one study. 98% is pretty solid. Even then, we can take the remaining 2% of real users and collapse them by a combination of IP address and user agent, if we really want to include them in stats.

Checking for human-ity is slightly trickier. Alas, captcha-ing every request to your site seems like a bad idea. One idea is to accept only count users with JavaScript as human users. The consensus seems to be that approximately 95% of web users have JavaScript turned on, so that’s decent, but the idea that “most bots don’t have JavaScript”, while sensible, doesn’t seem to have any hard evidence behind it. For the time being, I’ve implemented a simple JS script which pings back on page load to report that JS is on (of course), whether the cookie was picked up or not.

What we do have is user agent strings. While these can of course be spoofed, particularly by malicious bots, the majority of crawlers play by the rules. Today I created a blacklist and a black terms list based on two from robotstxt.org and user-agent.org. Filtering using these lists and not even assigning them cookies has significantly cut down on the rate of new “visitors” being recorded. A script has been committed, outside of the plugin’s directory, to produce updated blocklists down the line.

There are a few other ideas out there, like creating non-human-visible “poison links” and later throwing out all data with that IP and user agent. For the time being, I haven’t implemented anything of the sort and am hoping that the user agent filtering and perhaps JS filtering (if we choose to do that) are good enough.

I’ll share the results of my aforementioned A/A testing experiment (reset once today) on Monday.