Saturday, 19 of May of 2012

How good is Google Analytics?

Google Analytics isn't perfect but it is very good.

Both very good and not very good at all. Google Analytics is very good at giving simple answers to complex questions, which is great news for anyone without the budget to pay for expensive web analytics software. The bad news is that the answers it gives are far from exact, and you are usually not warned when this is the case.

To start with, Google Analytics is based on the activation of snippets of Java code. Not everyone browses with Java enabled and anyone who doesn’t will slip through the net undetected. Estimates for the proportion of users who routinely use the internet without Java are as high as 15%, sometimes a little more. Straight away that’s anything up to 15% of users who won’t be included in statistics generated by Google Analytics. Ouch!

It’s not all bad though. Assuming the proportion of users without Java enabled stays more or less constant, you can still get a lot of value out of the traffic values and trends reported by Google Analytics. Serious problems with the trends have been reported by several analytics professionals, so they shouldn’t be taken as gospel, but in most cases solid information can be obtained.

But then there are the errors. Some fancy site architectures cause huge problems when you try to apply tag-based web analytics solutions. False visits can be generated and sources can get all mixed up. Most of these issues can be fixed with very careful application of the code snippets, but in order to fix something you must first be aware that there is a problem. Watch for a large volume of tiny duration visits or a higher than expected proportion of direct arrivals- these could be a sign that something’s wrong, particularly if you are trying to track Flash activity.

Take care when applying segments other than the basic ‘All Visits’. Some calculations can’t be performed in other segments, be they your own custom ones or the standards that come ready-defined, and if you try to make these happen the active segment will revert to ‘All Visits’. Sharp eyes will note an error message but it’s easy to miss.

Of the multitude of calculations and investigations that can be made with Google Analytics, some work really well and some don’t. If you do a lot of comparison with past time periods you will probably see some inconsistencies appear pretty quickly, and that is just one example. The trick lies in knowing what you can rely on and what’s shaky, and what can be helped along with data from other sources.

But let’s take a step back here. We’re talking about a free web analytics tool, not something that costs tens of thousands of pounds a month (these services do exist, and people do pay for them). While Google Analytics does add value to AdWords and other revenue generating bits of the Google family, it is still free to all comers. Is it perfect? No, but web analytics is a very complicated discipline and it can’t be expected to be. Is it useful? Absolutely, if it’s used properly.

Google Analytics is a fabulous tool. It provides access to a level of detail that was previously only available to through a few tricky and/or expensive means and brought it into almost anyone’s reach. But like any complex piece of machinery, it needs to be treated with care and understood intimately if you’re to get accurate and actionable results out of it.


Building an analytics culture in your company

In order to get real value out of web analytics, you and your staff and contractors need to understand exactly what they are looking at, but most of the responsibility of conveying web analytics information smoothly and directly belongs to us, and not to our clients.

“Building an analytics culture in your company”. It sounds like a buzzword (ok, buzzphrase) and to be honest it probably is, but there is something of value in there.

In order to get real value out of web analytics, you and your staff and contractors need to understand exactly what they are looking at, but most of the responsibility of conveying web analytics information smoothly and directly belongs to us, and not to our clients. To this end we use visualisations and graphs, tables, and the written word in various forms. There is a world of difference between data and information. We believe that anyone looking at one of our reports should understand what’s being conveyed without difficulty.

The culture of analytics doesn’t mean your staff need to understand how to get at web analytics data or do a statistics course. They certainly don’t need to feel like an external analytics agency is judging their every move. All they really need is to understand how the information can help them. We like to work with a wide range of staff handling different aspects of a client’s business, and it usually doesn’t take long before different departments start to ask their own questions and engage with the analytics process.


Common sense SEO

SEO is simple. Make your site a good one and the rankings will follow.

SEO isn’t rocket science, or at least, not for small to medium sized websites. Unless you’re aiming to rank for very competitive keywords (in which case rocket science would probably be simpler, quicker, and cheaper), most SEO is pretty simple and you can almost certainly do it yourself if you feel the inclination.

Hiring SEO services is easier and saves time, but it can be expensive. To this end we can offer SEO advice and evaluation instead of a full package, which can be expensive, if you prefer. Sometimes not even that is necessary. Much can be done to improve the rankings for small websites with just a little guidance.

As with all search engine opimisation, content is king. Identify the search phrases you’d like to rank for and check where and how often they appear on your site. Aim for somewhere at around one occurrence of each per 100 words. Too many and your copy will look spammy and it won’t read well. Too few and search engine crawlers won’t pick up on the keywords and decide your page is relevant. Write for real readers, but keep your keywords in mind, and if you sell it, describe it.

Make sure your html is tidy, meta tags are populated, and the site looks professional. Ask for opinions from your mates and pay attention to what they say.

Once the site is up to scratch, submit it to some local directories. Google Places (the new name for the business directory linked to Google Maps) is free and inclusion is pretty much guarantees to all comers.

There is a great deal more to SEO, but to be entirely honest, small local businesses who are looking for publicity rather than full scale eCommerce probably don’t need to know too much about it. Make your site a good one and the rankings will follow.


Measures and metrics

acknowledging inaccuracy in your stats will actually do you a pretty big favour.

Traffic data is difficult to collect. Tag-based tracking systems like Google Analytics inevitably miss some visitors, and server log analysis isn’t perfectly accurate either. Let’s assume your tracking is Java-based, and perfectly optimised. It will still miss the 10 to 15% of users that routinely browse the wonders of the internet without Java enabled. However, that doesn’t mean you’ve got a total visit count with a 15% error on it.

What you have is a figure that is usually labelled ‘Total Visitors’, but is in fact not that at all. More correctly, it’s a lower bound on total visitors. It’s a measure of total visitor numbers, sure, but it is not the total number of visitors. When the measure goes up, you know visitors have gone up (assuming the fraction of Java-less users remains the same, which is not unreasonable). When the measure goes down, it’s fair to say that traffic has dropped.

The lower bound on total visitors is a very useful thing to know, but it’s also useful to acknowledge that it is not a true and perfect total visitor count. For a start, presenting the real state of affairs to potential investors or advertisers lets you a use bigger best estimate traffic figure than the one presented by your Java-based tracking system. Your website looks more popular. In fact, it probably is more popular than you think if all you are relying on right now is Java-based tracking like that used by Google Analytics.

We believe you should always try and get an idea of how accurate all your figures are. Knowing that protects you from making poor decisions based on poor data and gives you the confidence to move forward from a fully justifiable position, but in cases like the one discussed above, acknowledging inaccuracy in your stats will actually do you a pretty big favour.


Time scales for SEO and web analytics

Whether your search rankings change for the better for for the worse, wait at least four or five days before celebrating or panicking.

Patience is a virtue, except on the internet. The sheer speed of online communication is astounding when you consider the distances involved, and we’ve all come to expect more or less instant access to data housed anywhere in the world.

However, not everything comes quickly. The benefits of SEO in particular come to those who wait. The time scale for a comprehensive search engine optimisation campaign is rarely less than a month or two for sites in a moderately competitive field. Sites that start out with very little content and few links may see gains fast if they choose keywords conservatively and keep their focus localised, but most companies looking to make serious money using the internet will have to wait a little longer to see strong progress.

That’s often said on various blogs and forums, but what is less commonly mentioned is that the initial ranking gains made during an SEO campaign may not last long. Don’t get too excited if someone gets you into the top spot for your chosen keyword, because that shift in rankings may or may not be robust. Algorithm changes- and Google makes at least one algorithm change per day on average- and action by competing sites can see your place in the result pages drop back down.

Some SEO actions produce more robust gains than others, and the stability of your step up in the rankings will also depend on how competitive your field and your keywords are. SEO is a long term process and it needs to be ongoing- a least to some degree- if you intend to keep any gains you make, but there will probably be short term fluctuations in your rankings from day to day as well.

Whether your rankings change for the better for for the worse, wait at least four or five days before celebrating or panicking.


The difference between data and information

Data is valuable. Nobody holds data in higher esteem than data miners and web analysts, but there is a world of difference between data and information.

Take the famous (to devotees of data mining) and much admired (by devotees of data mining) Wallmart customer basket database. The implementation of barcode scanners meant that each collection of products bought by a Wallmart customers could be recorded, and they did just that. Back in the early years of the new century this database was probably the biggest in the world. It ran into TeraBytes back when that was an impressive achievement (even to devotees of data mining). Depending on how you reckon 1 TB, it’s either 1000000000000 or 1099511627776 bytes. The good folks at the University of California, Berkeley, reckon that amount of data at 50000 trees or a good sized forest worth of paper.   To put it another way, Dr Jess’ PhD thesis weighed in at 2464KB, or one 405844th of a TeraByte.

Cute arithmetic aside, a TB is a lot of data, and there are plenty of databases of that size around today. Many of the server logs we deal with are well into GB or larger. Wading through all those numbers is the job of data mining algorithms and not people, and that’s as it should be.

It’s the job of web analysts to sort through the data spat out by programs like Google Analytics and Webalizer, sometimes using algorithms and software tools and sometimes their own judgment.

The task isn’t just to build a giant database, it’s to convert that data into usable information. Wallmart doesn’t find out whether coffee and sugar are bought together by reading through their data, and nobody trying to make money from a website needs to wade through masses of facts and figures and sieve out what’s important. That’s our job.

We know what to look for and how to extract information from data, and we also know how to present that information in an easily accessible form, which is half the battle when dealing with large databases.


Musings on metadata

Checking metadata is important in analytics and in modelling.

Years ago when I was working with earth scientists on a regular basis, a hydrologist asked me to take a look at a couple of his time series. A usually reliable rainfall-runoff model was refusing to calibrate properly, and when forced, producing results that were obviously nonsensical.

The problem was easy enough to spot. Of the two series of daily values, peaks were appearing in the runoff (river level) before the rainfall event was recorded. Conceptual models get upset when things like that happen.

A quick look at the meta data revealed the cause. The data wasn’t necessarily poor or even unrepresentative of real world behaviour, but the measurements were not taken in the same place or by the same person. Rainfall was measured every day at 9am, using a day defined from 0900 to 0859. Runoff was automatically logged at midnight every night, 0000 to 2359. You can see what happened from there.

Could be go back and re-collect a decade of data? No. Could we interpolate one series to fudge the definition of a day? Absolutely. Problem solved and we had a happy and useable conceptual model.

However, it’s worth noting that almost all small rural rainfall measurements are taken 9am to 9am, and almost all runoff values assume the day begins at midnight. The problem exists in many, many more datasets than we applied the correction algorithm to. All the subsequent modelling done with uncorrected rainfall-runoff data could have been improved in accuracy and utility had more people taken the time to interrogate the metadata and see that all was not well.

More recently I’ve noticed a similar issue crop up in web analytics. Instead of differing definitions of what constitutes a daily period, the trick is in the time zones used by various data collection devices and logs. If you’re seeing an inconsistency between two or more datasets or series, it could be something as simple as that.

Particularly in analytics, it might not be that simple (and often won’t be), but the first check you make should always be in the metadata. If you’ve got it, check it, and if you haven’t, consistency problems may well be the least of your worries!


The significance of significant figures

Consider the precision of analytics figures carefully. They are not as good as they appear.

At this point in time, the use of significant figures is almost exclusively confined to scientific disciplines. It’s a pity, because these days so many businesses are relying in analytics to provide the basis for some serious decisions and the figures they have are rarely as accurate as they might appear.

The significant figures (or significant digits) from any number are those that contribute meaningfully once uncertainty is taken into account. At DrJess we like to get stuck in and really investigate uncertainty, but the first step is to realise it’s there.

Say Google Analytics says drjess.com/blog site has seen 1621483 unique viewers in the last month. A touch on the generous side, but in thought experiments we are allowed to be optimistic.  Neither the Goog nor almost any other free web analytics suites have much to say on the subject of accuracy, but presenting in any halfway scientific context a number with that many significant figures implies a very high confidence in its accuracy, percentage-wise.

Web analytics pros know that Google Analytics and its tag-based friends tend to under count traffic by rather more than 10%. More on the reasons why in another post, but let’s be generous and assume a maximum 10% error for the sake of a simple thought experiment. Applying that to my imaginary unique user count of somewhere between  1621483 and 1783631.3.

If I had to pick a single number to represent that range, it would be 1700000, not 1621483. Not only is it probably more accurate, it gives anyone looking at it a much better indication of what the uncertainty is.

Now, if I was going to base some serious and potentially very expensive business decisions based on analytics numbers, I’d want to know just how good my figures were. So would most people who want analytics done, if it occurred to them to wonder if precision might be lacking in the first place. With the over-precise numbers spat out by most analytics programs, the question is rarely posed unless inconsistencies crop up.

I’m not blaming Google. The simple truth is that given the choice between a web analytics program that spits out 1700000 and one that spits out 1621483, the overwhelming majority of users will perceive the latter as more accurate.

Legend states that the first surveyor of Mount Everest measured the height at 29000ft exactly. It was reported as 29002ft, supposedly because the surveyor didn’t think anyone would believe his rounder figure had a reasonable degree of precision.

That was back in 1856. Unfortunately it still seems we’re having this kind of problem. The only solution is to state clearly what the uncertainty is. Bring it out of hiding and discuss error, accuracy, and precision before making decisions.