Web analytics accuracy issues- why don’t the numbers match up?
At first glance, web analytics (the practice of finding out about what visitors do on your site) should be easy. All you have to do is track each one, find out which pages they visit and how long they spend on them. However, even a total number of visitors per day or per week can be very hard to pin down with any accuracy.
Most people treat the figures that come from Google Analytics or another reporting suite as gospel, so it often comes as a shock when two sets of figures are compared. If the visitor numbers that come out of Google Analytics are plotted against the visitor numbers obtained by statistical analysis of the server logs, the GA figures will almost always be significantly lower.
It’s actually pretty hard to count visitors. First of all, you have to decide who should be counted- there is a world of difference between a real person looking at your site and an automated search crawler indexing the pages. Server logs count both search bots and human browsers, and while tools like AWStats and Webalizer do try and filter out the crawlers it can be quite hard to catch them all, so real visitor numbers might be a little inflated.
Google Analytics uses another approach. Users loading a page trigger a piece of Javascript. This code snippet is not tripped by robot crawlers, so the only people who register as visitors are real people. The downside is that a significant percentage of users browse the internet with Java disabled so they aren’t counted either. Therefore Google Analytics tends to underestimate visitor numbers.
There is also the issue of what counts as a session. If a user closes their browser window and then immediately reopens it on one of your pages, does that count as a separate visit? If they are inactive for an hour, then resume browsing, should that count as the same session? Different analytics packages have different ways of counting sessions and this will also affect total visit counts.
Then there are sampling issues. If your site sees millions of visits, some software packages will sample the dataset rather than examine each individual element. This can provide a good statistical picture of what’s going on, but it inevitably introduces some degree of error.
So, with all those accuracy problems, how can web analysts hope to generate useful insights? Consistency is the key. If Google Analytics always underestimates visitor numbers by 12% because 12% of the target market disables Java, the exact figures may be off but the trends are still sound. A 30% rise in visitor numbers is still a 30% rise in visitor numbers, even if the exact numbers aren’t known precisely.
Usually, problems only arise when measurement methods change. If you’re moving from one analytics package to another, it’s good practice to run the two side by side for at least a month. That way you’ll be able to see how the two sets of figures relate to one another. Don’t worry if they don’t match up exactly, because they almost certainly won’t!