Processing Raw Google Analytics data with Elastic Stack - Part 1, Getting raw data.


Posted by Trka in Infrastructure, Metrics and Mining, Mining on Jul 29, 2017

Just Part 1: Alone, this is only slightly magical. This part is about configuring GA and your site-side tracking code to collect traffic in a way that can be collected 1:1. Parts 2 & 3 aren't done yet. ...but they will be one day.

Assumptions\Prerequisites: This is a pretty... adventurous topic. You'll need a solid understanding of Google Analytics, Javascript, and Elastic Stack to go from start to finish. 

Possible legal note: To my knowledge, this isn't violating the GA Terms of Service. If it did, that would certainly revoke my GAIQ certificate. Google does offer turnkey solutions for this, so I encourage you to at least consider those, especially if you either a) need more than what this article offers or b) aren't quite comfortable with the implementation here. Like most things I do, this is MIT licensed - should it come to question - you're free to do what you want with the info, but I'm not responsible for the outcome. Due diligence :) 

Google provides "honest" routes and subscriptions that allow you to fetch 1:1 user\hit metrics from Analytics. These come with overhead: be they enterprise-high pricing or extra admin platforms. If, however, all you want from those platforms is 1:1 user\hit tracking, this can be achieved in Google Analytics with a couple of custom dimensions and a few lines of Javascript. 

Setup custom dimensions in Google Analytics.

You'll need to add two custom dimensions to your GA property. One will be a uuid we generate on the client side, the other will be the UTC timestamp of the pageview. In GA, load the admin panel (for GA Version on 2017-07-04) Admin > [Account, of course] > Property > Custom Definitions > Custom Dimensions. Add these two dimensions: 

  • browser_id (user-scoped)
  • utc_timestamp (hit-scoped) 

Your Dimensions will look like this when you're done. (This screenshot, and all code samples, etc to-do with GA assume you have a 'clean' config - no other custom dimensions, etc. If you don't have a clean config, I assume you know how to make the necessary adjustments to your code.)

Now your Analytics console will respect the two additional dimensions, so send them along with your page hit. In the Plunker below, we load in the uuid library, which does a very good job of generating UUIDs for us - otherwise, this is... not a fun thing to generate. We start with the common Google Analytics pageview message, but we bake in our custom dimensions.

dimension2 is easy; we just get the current timestamp. dimension1 - the uuid - is the magic. Since the dimension is user-scoped on analytics' end, we only set it if it's not already there. We use a browser_uuid_set=1 cookie to persist this across hits. When we prepare the pageview, we check this cookie and, if it's not present, we set a uuid for the dimension.

After putting the new tracking logic in place and making a few clicks around our site, we can visit our Analytics console to check that it's functioning properly. I went to Reports > Behavior > Site Content > All Pages. From the default view, I set the secondary dimension to our custom browser_id. This gives me the following screen 

The hits are a little underwhelming - the site's still in development :) - but we can see my personal traffic is collecting our custom uuid in the browser_id dimension. We'll now be able to track with user\hit granularity. The mechanics of GA's sampling are a non-issue here, as samples - however broad they're configured - will contain at least one row for each combination of dimensions. Our user\hit dimensions will ensure each pageview is sampled - unless, somehow, we hit an edge case where a single user hits the same page twice in one millisecond. 

Definite legal notes:

  1. On the Analytics end, user-grained tracking is okay. But user-grained tracking with PII is strictly NOT okay. When recording custom data, be sure not to push anything personally-identifiable to Google. 
  2. Cookies: you're using a tracking cookie - albeit a light one. Due diligence: find out if your use-case requires a cookies notice or opt-in and, if it does, include one.

Inside the Analytics Console, we can do fun things with this. Specifically, using the advanced search feature on most reports\views, we can add filters for specific browser_id's to profile user-specific flows and site behavior. But we can do much more with it. Moving forward, we'll use the Analytics API to fetch our granular pageviews and feed them to Elastic Stack for mind-blowing profiles and correlations with other metrics - btw: we're also feeding Facebook, Email events, and who-knows-what-else into our Elastic instance.

[part 1 inspired by: Dayne Batten's Post