Recently, I redid an old website of mine. It had neglected it for a couple of months.I brought up Google Analytics to see how the site performed the previous months, suspecting almost zero activity.
To my surprise, the Google Analytics account showed lots of activity.
Just not the activity I wished for.
Event spam, calls to “vote for Trump” hidden in language data, referral spam, ghost referrers, it had caught every kind of spam one could imagine.
Google Analytics spam is a really nasty thing making it hard to compare data, polluting your reports, dashboards and making your life as an analyst hard. So let’s see how we can get rid of it.
I structured this post into three parts. If you just want to solve the problem, read part 1 (short answer: there is a great, always up to date, guide at analyticsedge). If you are working with clients or large accounts, read part 2 as well to get a grip on best practices to handle Google Analytics spam. If you still are here afterwards then read part 3 which is me trying to solve the spam problem more generally (I haven’t come up with a good implementation yet).
Part 1: Removing Google Analytics Spam from your GA Account
I am one of the lucky people who get all three possible kinds of Google Analytics spam (currently possible) which are
- Referral spam (spammy fake referrers e.g. the current wave of latin alphabet “google.com” spam)
- Organic search spam (obviously spam “Keyword” data)
- Event spam (Event categories/actions/labels you didn’t set up containing spam messages)
In order they appear in Google Analytics like this
How do we get rid of those things? The short answer is, use the definitive, regularly updated, guide at analyticsedge.
The process I followed, as outlined in the article, consists of three parts:
- Create “valid hostname” filters (make sure you have a good list of hostnames ready – you usually have more than one!)
- Create “spam crawler” filters (filters for campaign source – in case you’re wondering why the article is providing 4 different ones at time of writing, it’s because the max length of those filters is 255 characters)
- Turning on Google’s bot & spider option
In 1. and 2. you might create one, or multiple filters. The article at time of writing contains 4 different campaign source filters because the max. length of the regular expressions is 255 characters. Similarly you might turn up a list of more than 255 characters worth of valid hostnames.
This cleans up your data. The second step, to compare the historical data, is to create spam-excluding segments (Remember, in Google Analytics it’s not possible to delete or change any historical data).
Tip: You can’t copy filters in Google Analytics. But you can copy views. So if you happen to create new views, you can create one test view with all your spam filters, then copy it, rename it and apply whatever new filters you need. That way you will also copy the spam filters.
Part 2: Handling Google Analytics Spam Best Practices
Alright, so the article mentioned above will fix your problem. If you control your data from time to time, your account will stay just fine. Using segment you will still be able to produce meaningful reports and evaluations.
This is all fine for small accounts (1-2 properties, and at most a handful of views). But for larger accounts or for client work (i.e. if you’re working in a Google Analytics agency) you need much more. You need a rock-solid way to guarantee the quality of the data appearing everywhere.
There are two reasons why this is so important. In larger accounts multiple people are working together, maybe even an agency and the client. So it’s both, harder to spot every kind of spam (because the account is large, not everyone may have the right segments etc.) and harder to communicate to everyone which changes have taken place (e.g. when was the spam removed? Where do I still see polluted data? Which segment is up to date?).
For that reason here are three best practices to handle such projects.
- Checking the data quality for spam on a regular basis (common sense, below is a list of places to check)
- Eliminate spam at the earliest stage possible (the earlier you eliminate it, the less you have to clean up down the line)
- Document & share everything at the right place (this is the hardest part really! So make sure you have a rock solid documentation & communication policy – esp. between client & agency)
Best Practice 1: Check the data quality regularly
I’ll outline my take on a proper procedure to keep all Analytics data clean (not just from spam) and accurate in a future post. For now, I recommend to check at least once a month. Use Google Analytics alerts for deviations. You should check at least the following default reports for weird looking data:
- Hostname report (Audience > Technology > Network – Primary Dimension “Hostname”)
- Event categories (Behavior > Events > Overview)
- Campaign source (Acquisition > All Traffic > Source/Medium)
- The article above for new kinds of spam
Best Practice 2: Eliminate spam at the earliest possible stage
A Google Analytics hit, once sent, goes through a couple of stages, namely the collection, the processing & configuration and the reporting stage (see the analytics academy for the details). Before that, the spammer has to visit the website (possibly, there are also spam bots which don’t visit your site but iterate GA ids).
So there are four stages at which we can try to get rid of spam. Different stages work for different kinds of spam, but the best practice is to try to eliminate the spam as early as possible.
- Server: At the server-level we can implement a .htaccess file to block certain known spam IPs from accessing the website altogether (and thus scraping the Google Analytics code from your website). https://github.com/bluedragonz/bad-bot-blocker/blob/master/.htaccess
- Tag Manager: If you don’t use a tag manager, in my opinion you should. The data is collected in some data layer because it gets send to Google Analytics. So there must be a way to get rid of spam here. One option (although not perfect) is to use http://www.lunametrics.com/blog/2015/03/19/eliminating-dumb-ghost-referral-traffic/. I’m sure there is a better way to get rid of spam here, but I haven’t seen any implementations yet. For an idea see the third section.
- Processing & Collection: In the processing and collection step we use the practices outlined above, using bot & spider detection and filters.
- Reporting: In the reporting stage we use segments, dashboards and custom reports to display the data (temporarily changed, not permanently as in the other stages!). This allows us to compare spammed data to new unspammed data. We also can export data to other reporting & dashboarding tools here.
Best Practice 3: Document and share everything with everyone
Why is that so important and who is “everyone”? Really, everyone who looks into the account. They will compare data, they will call you to ask why traffic is increasing, then suddenly decreasing. How do you do that? You implement a proper documentation for everything you do in your Google Analytics & GTM Account. You share it with everyone. I’ll outline a proper structure in a future blogpost.
For now I will point you to Daniel Waisbergs explanations on his blog (which is also in his new book “Google Analytics Integrations”).
Part III: Using GTM to filter data.
It isn’t that hard to filter the data in GTM for valid data. We could for instance check the Event categories for the list of valid ones (which are listed in the GTM anyways). We then would have to ensure that only the GTM sent data is actually interpreted in GA. To do so, we would have to implement a custom dimension/something else to filter for by default in GA.
Google Analytics spam is a serious problem and has to be dealt with. In particular if you work on larger accounts of in an agency in my opinion it’s important to have a proper routine at work. I hope that now you have the tools at hand necessary to set everything up.