Saturday, 19 of May of 2012

Category » Technical notes

Web analytics accuracy issues- why don’t the numbers match up?

At first glance, web analytics (the practice of finding out about what visitors do on your site) should be easy. All you have to do is track each one, find out which pages they visit and how long they spend on them. However, even a total number of visitors per day or per week can be very hard to pin down with any accuracy.

Most people treat the figures that come from Google Analytics or another reporting suite as gospel, so it often comes as a shock when two sets of figures are compared. If the visitor numbers that come out of Google Analytics are plotted against the visitor numbers obtained by statistical analysis of the server logs, the GA figures will almost always be significantly lower.

It’s actually pretty hard to count visitors. First of all, you have to decide who should be counted- there is a world of difference between a real person looking at your site and an automated search crawler indexing the pages. Server logs count both search bots and human browsers, and while tools like AWStats and Webalizer do try and filter out the crawlers it can be quite hard to catch them all, so real visitor numbers might be a little inflated.

Google Analytics uses another approach. Users loading a page trigger a piece of Javascript. This code snippet is not tripped by robot crawlers, so the only people who register as visitors are real people. The downside is that a significant percentage of users browse the internet with Java disabled so they aren’t counted either. Therefore Google Analytics tends to underestimate visitor numbers.

There is also the issue of what counts as a session. If a user closes their browser window and then immediately reopens it on one of your pages, does that count as a separate visit? If they are inactive for an hour, then resume browsing, should that count as the same session? Different analytics packages have different ways of counting sessions and this will also affect total visit counts.

Then there are sampling issues. If your site sees millions of visits, some software packages will sample the dataset rather than examine each individual element. This can provide a good statistical picture of what’s going on, but it inevitably introduces some degree of error.

So, with all those accuracy problems, how can web analysts hope to generate useful insights? Consistency is the key. If Google Analytics always underestimates visitor numbers by 12% because 12% of the target market disables Java, the exact figures may be off but the trends are still sound. A 30% rise in visitor numbers is still a 30% rise in visitor numbers, even if the exact numbers aren’t known precisely.

Usually, problems only arise when measurement methods change. If you’re moving from one analytics package to another, it’s good practice to run the two side by side for at least a month. That way you’ll be able to see how the two sets of figures relate to one another. Don’t worry if they don’t match up exactly, because they almost certainly won’t!


Quick Twitter monitoring for busy businesses

No business wants to waste time (and therefore money) on activities that don’t deliver a good return, and it’s true that some people still consider Twitter a social tool rather than a commercial one, but there are plenty of reasons to get involved with it. Twitter can be used to establish yourself as a knowledgeable expert in your field, or it can be used to find new clients directly.

In terms of the raw number of searches performed each month, Twitter sees more than Bing and Yahoo together. The figure is a little misleading (the percentage of automated searches is much higher on Twitter) but it does demonstrate the massive power and influence of social media. Twitter is now being used as a source of news as well as entertainment.

Social searching is also growing fast. 20 years ago, you might have picked up the telephone directory when to find a local plumber. 10 years ago, searching through Google would have been just as likely. Now, rather than relying on a search engine’s rankings to find the best plumber in your area, you may well put out a call on Facebook or Tweet something like ‘Anyone know a good plumber in Philadelphia?’

People who use Twitter in this way are looking for trusted recommendations from real people, but smart businesses who monitor Twitter for terms like “plumber” + “Philadelphia” can also reply, and if they appear helpful they’ve got a good chance of picking up a new client.

Twitter monitoring isn’t difficult and it doesn’t have to take up time. Tools like Monitter (monitter.com) trawl through Twitter traffic in real time, collecting relevant posts within a set geographical area. Others, like Tweetbeep  can send email updates whenever your chosen set of search terms comes up. If you use Google Alerts to monitor new websites, Tweetbeep can extend your watch to include Twitter traffic. Twazzup  searches through recent news items and Tweets for the terms of your choice. There are also numerous mobile apps for sending, reading, and monitorring tweets.

If you’re serious about Twitter monitoring Monitter can be left running in the background as you work. Once set up, Tweetbeep won’t bother you unless something relevant comes up, and whenever you happen to have a spare five minutes, you can check Twazzup or a mobile app like Echofon. Twitter monitoring doesn’t have to take long and it doesn’t have to be complex. Anyone can do it, and enjoy the benefits.


Google Analytics update may have a greater impact than expected

Last week Google rolled out a minor but significant change in the way user sessions are calculated. They weren’t expecting many sites to see much of a change in their metrics, but a number of sources in the analytics community are reporting significant shifts, most likely caused by the algorithm change.

The changes only apply to visitors who leave a site, then re-enter by other means within the space of half an hour, or when a user closes their browser for a short time, then re-opens it. Under the old regime closure of the browser meant closure of a session, and that is no longer guaranteed. On the other hand, arriving on a site through organic search, bookmarking it, leaving and returning within 30 minutes would not have counted as two separate sessions before the changes came through, but it will now.

Sound complicated? It’s not the exact definition of a session that most users need to worry about, but the change in definition. Whenever a metric is altered, it can mean confusion. Some figures may go up and some may go down, even though there is no change in traffic pattern.

Google didn’t think that this batch of modifications would have much of an impact (and there were good reasons to tweak session definitions) but many reliable sources are seeing swings of up to 10% or more when comparing figures from before and after the algorithm change.

Conversion rates and average time on site measures are most likely to be affected, along with total visitor numbers and any metric calculated using it.

When tracking the changes in your Google Analytics metrics from July to August, be aware that you might see the effect of the tweak. If you want to check, take a close look at the data on either side of the 11th/12th of this month. Any sudden, statistically unusual jumps that can’t be otherwise attributed could be down to this GA update.


Banner ads by stealth

Google has found a new way to combat banner blindness- PPC advertising by stealth.

Banner advertising is a struggling beast, and as the massive pay per click industry depends on it, something had to change.

‘Banner blindness’ is the name given to the new-found ability of the brain to ignore any piece of a website that looks like an advertising banner. It’s not just an invention of SEO and web marketing types, but a well established and well understood phenomenon. Some internet users choose to block ads with software, others are so used to gaudy banners along the top of their browser screens that they simply don’t see them any more.

Making ads stand out from the background has been the favourite method of combating banner blindness for years. The idea has been to make ‘em bigger, make ‘em brighter, and in extreme cases make ‘em flash on and off or jump around the page. This strategy didn’t work, and annoying Flash-based PPC ads only really served to drive more people to install better ad blocking software.

Every impression that doesn’t result in a click is lost revenue, so obviously this banner blindness thing is a serious problem for AdWords and other PPC systems.

A couple of PPC providers got clever and started placing ads away from the usual locations, eg in sidebars, and this probably did work for a little while. Then, of course, users got used to an ad in the right hand sidebar and started ignoring it. The human brain is quick to adapt.

Google’s new search engine results page format takes a completely different approach to PPC ad placement. Smart cookies that they are, Google realised that making ads stand out is not the answer. The paid ads that now appear on their results pages now look more like organic search results than ever before. They blend in almost seamlessly, and even the most jaded internet brain has to actually look at the ads to decide whether or not they are in fact ads at all.

Seeing and reading are the first steps along the way to clicking a PPC ad. Will this new strategy drive up click through rates? Who knows, but I expect we’ll see more steathly advertising and less neon colours and flashy graphics in the near future.


How good is Google Analytics?

Google Analytics isn't perfect but it is very good.

Both very good and not very good at all. Google Analytics is very good at giving simple answers to complex questions, which is great news for anyone without the budget to pay for expensive web analytics software. The bad news is that the answers it gives are far from exact, and you are usually not warned when this is the case.

To start with, Google Analytics is based on the activation of snippets of Java code. Not everyone browses with Java enabled and anyone who doesn’t will slip through the net undetected. Estimates for the proportion of users who routinely use the internet without Java are as high as 15%, sometimes a little more. Straight away that’s anything up to 15% of users who won’t be included in statistics generated by Google Analytics. Ouch!

It’s not all bad though. Assuming the proportion of users without Java enabled stays more or less constant, you can still get a lot of value out of the traffic values and trends reported by Google Analytics. Serious problems with the trends have been reported by several analytics professionals, so they shouldn’t be taken as gospel, but in most cases solid information can be obtained.

But then there are the errors. Some fancy site architectures cause huge problems when you try to apply tag-based web analytics solutions. False visits can be generated and sources can get all mixed up. Most of these issues can be fixed with very careful application of the code snippets, but in order to fix something you must first be aware that there is a problem. Watch for a large volume of tiny duration visits or a higher than expected proportion of direct arrivals- these could be a sign that something’s wrong, particularly if you are trying to track Flash activity.

Take care when applying segments other than the basic ‘All Visits’. Some calculations can’t be performed in other segments, be they your own custom ones or the standards that come ready-defined, and if you try to make these happen the active segment will revert to ‘All Visits’. Sharp eyes will note an error message but it’s easy to miss.

Of the multitude of calculations and investigations that can be made with Google Analytics, some work really well and some don’t. If you do a lot of comparison with past time periods you will probably see some inconsistencies appear pretty quickly, and that is just one example. The trick lies in knowing what you can rely on and what’s shaky, and what can be helped along with data from other sources.

But let’s take a step back here. We’re talking about a free web analytics tool, not something that costs tens of thousands of pounds a month (these services do exist, and people do pay for them). While Google Analytics does add value to AdWords and other revenue generating bits of the Google family, it is still free to all comers. Is it perfect? No, but web analytics is a very complicated discipline and it can’t be expected to be. Is it useful? Absolutely, if it’s used properly.

Google Analytics is a fabulous tool. It provides access to a level of detail that was previously only available to through a few tricky and/or expensive means and brought it into almost anyone’s reach. But like any complex piece of machinery, it needs to be treated with care and understood intimately if you’re to get accurate and actionable results out of it.


Building an analytics culture in your company

In order to get real value out of web analytics, you and your staff and contractors need to understand exactly what they are looking at, but most of the responsibility of conveying web analytics information smoothly and directly belongs to us, and not to our clients.

“Building an analytics culture in your company”. It sounds like a buzzword (ok, buzzphrase) and to be honest it probably is, but there is something of value in there.

In order to get real value out of web analytics, you and your staff and contractors need to understand exactly what they are looking at, but most of the responsibility of conveying web analytics information smoothly and directly belongs to us, and not to our clients. To this end we use visualisations and graphs, tables, and the written word in various forms. There is a world of difference between data and information. We believe that anyone looking at one of our reports should understand what’s being conveyed without difficulty.

The culture of analytics doesn’t mean your staff need to understand how to get at web analytics data or do a statistics course. They certainly don’t need to feel like an external analytics agency is judging their every move. All they really need is to understand how the information can help them. We like to work with a wide range of staff handling different aspects of a client’s business, and it usually doesn’t take long before different departments start to ask their own questions and engage with the analytics process.


Measures and metrics

acknowledging inaccuracy in your stats will actually do you a pretty big favour.

Traffic data is difficult to collect. Tag-based tracking systems like Google Analytics inevitably miss some visitors, and server log analysis isn’t perfectly accurate either. Let’s assume your tracking is Java-based, and perfectly optimised. It will still miss the 10 to 15% of users that routinely browse the wonders of the internet without Java enabled. However, that doesn’t mean you’ve got a total visit count with a 15% error on it.

What you have is a figure that is usually labelled ‘Total Visitors’, but is in fact not that at all. More correctly, it’s a lower bound on total visitors. It’s a measure of total visitor numbers, sure, but it is not the total number of visitors. When the measure goes up, you know visitors have gone up (assuming the fraction of Java-less users remains the same, which is not unreasonable). When the measure goes down, it’s fair to say that traffic has dropped.

The lower bound on total visitors is a very useful thing to know, but it’s also useful to acknowledge that it is not a true and perfect total visitor count. For a start, presenting the real state of affairs to potential investors or advertisers lets you a use bigger best estimate traffic figure than the one presented by your Java-based tracking system. Your website looks more popular. In fact, it probably is more popular than you think if all you are relying on right now is Java-based tracking like that used by Google Analytics.

We believe you should always try and get an idea of how accurate all your figures are. Knowing that protects you from making poor decisions based on poor data and gives you the confidence to move forward from a fully justifiable position, but in cases like the one discussed above, acknowledging inaccuracy in your stats will actually do you a pretty big favour.


Time scales for SEO and web analytics

Whether your search rankings change for the better for for the worse, wait at least four or five days before celebrating or panicking.

Patience is a virtue, except on the internet. The sheer speed of online communication is astounding when you consider the distances involved, and we’ve all come to expect more or less instant access to data housed anywhere in the world.

However, not everything comes quickly. The benefits of SEO in particular come to those who wait. The time scale for a comprehensive search engine optimisation campaign is rarely less than a month or two for sites in a moderately competitive field. Sites that start out with very little content and few links may see gains fast if they choose keywords conservatively and keep their focus localised, but most companies looking to make serious money using the internet will have to wait a little longer to see strong progress.

That’s often said on various blogs and forums, but what is less commonly mentioned is that the initial ranking gains made during an SEO campaign may not last long. Don’t get too excited if someone gets you into the top spot for your chosen keyword, because that shift in rankings may or may not be robust. Algorithm changes- and Google makes at least one algorithm change per day on average- and action by competing sites can see your place in the result pages drop back down.

Some SEO actions produce more robust gains than others, and the stability of your step up in the rankings will also depend on how competitive your field and your keywords are. SEO is a long term process and it needs to be ongoing- a least to some degree- if you intend to keep any gains you make, but there will probably be short term fluctuations in your rankings from day to day as well.

Whether your rankings change for the better for for the worse, wait at least four or five days before celebrating or panicking.


The difference between data and information

Data is valuable. Nobody holds data in higher esteem than data miners and web analysts, but there is a world of difference between data and information.

Take the famous (to devotees of data mining) and much admired (by devotees of data mining) Wallmart customer basket database. The implementation of barcode scanners meant that each collection of products bought by a Wallmart customers could be recorded, and they did just that. Back in the early years of the new century this database was probably the biggest in the world. It ran into TeraBytes back when that was an impressive achievement (even to devotees of data mining). Depending on how you reckon 1 TB, it’s either 1000000000000 or 1099511627776 bytes. The good folks at the University of California, Berkeley, reckon that amount of data at 50000 trees or a good sized forest worth of paper.   To put it another way, Dr Jess’ PhD thesis weighed in at 2464KB, or one 405844th of a TeraByte.

Cute arithmetic aside, a TB is a lot of data, and there are plenty of databases of that size around today. Many of the server logs we deal with are well into GB or larger. Wading through all those numbers is the job of data mining algorithms and not people, and that’s as it should be.

It’s the job of web analysts to sort through the data spat out by programs like Google Analytics and Webalizer, sometimes using algorithms and software tools and sometimes their own judgment.

The task isn’t just to build a giant database, it’s to convert that data into usable information. Wallmart doesn’t find out whether coffee and sugar are bought together by reading through their data, and nobody trying to make money from a website needs to wade through masses of facts and figures and sieve out what’s important. That’s our job.

We know what to look for and how to extract information from data, and we also know how to present that information in an easily accessible form, which is half the battle when dealing with large databases.


Musings on metadata

Checking metadata is important in analytics and in modelling.

Years ago when I was working with earth scientists on a regular basis, a hydrologist asked me to take a look at a couple of his time series. A usually reliable rainfall-runoff model was refusing to calibrate properly, and when forced, producing results that were obviously nonsensical.

The problem was easy enough to spot. Of the two series of daily values, peaks were appearing in the runoff (river level) before the rainfall event was recorded. Conceptual models get upset when things like that happen.

A quick look at the meta data revealed the cause. The data wasn’t necessarily poor or even unrepresentative of real world behaviour, but the measurements were not taken in the same place or by the same person. Rainfall was measured every day at 9am, using a day defined from 0900 to 0859. Runoff was automatically logged at midnight every night, 0000 to 2359. You can see what happened from there.

Could be go back and re-collect a decade of data? No. Could we interpolate one series to fudge the definition of a day? Absolutely. Problem solved and we had a happy and useable conceptual model.

However, it’s worth noting that almost all small rural rainfall measurements are taken 9am to 9am, and almost all runoff values assume the day begins at midnight. The problem exists in many, many more datasets than we applied the correction algorithm to. All the subsequent modelling done with uncorrected rainfall-runoff data could have been improved in accuracy and utility had more people taken the time to interrogate the metadata and see that all was not well.

More recently I’ve noticed a similar issue crop up in web analytics. Instead of differing definitions of what constitutes a daily period, the trick is in the time zones used by various data collection devices and logs. If you’re seeing an inconsistency between two or more datasets or series, it could be something as simple as that.

Particularly in analytics, it might not be that simple (and often won’t be), but the first check you make should always be in the metadata. If you’ve got it, check it, and if you haven’t, consistency problems may well be the least of your worries!