Maybe, you are also in a start-up building out your own stack from scratch. I hope you’ll read what we’re doing here at DeadHappy and get some ideas or encouragement. Maybe when I look back next year I too will be encouraged. Maybe you’d like to join me at DeadHappy? Read this and you’ll see there is so much value you could add. Maybe you are a salesperson or recruiter who will skim read this and come up with a relevant pitch …. Who knows? But I live in hope.
The Data Community
The data stack at DeadHappy is built upon a community of data people. At the core of the data community are two ‘data dudes’ Kanmi and myself (‘Data Dude’ was not my preferred job title, but it’s stuck). Our roles are to do anything that needs doing related to data. Before there were Data Dudes (and for some time after) a good amount of the data stuff at DeadHappy was originally produced by our C-suit. Nick Meredith pretends to be a CTO but he’s really just our reporting analyst and Dominic styles himself as chief operations officer for investors but really he’s our Data Analyst.

The Data dudes are almost completely remote. This is partly due to covid regulations and partly by choice (we’re both pretty introverted). However after weeks and months of zoom calls I decided to finally determine if Kanmi had legs… this is photographic evidence that he does.

Samantha Grant – Product manager

John Welbourn-Smith – DevSecOps Engineer

Nick Meredith – CTO

Dominic Holton – Data analyst
The problem is that these c-suite data geniuses primarily work in google sheets and geckoboards augmented by Dominic’s ‘innovative sql’ (Kanmi’s term). Our first priority as a data dudes has been to move away from these sheets towards DeadHappy’s data platform, which I refer to as the Crypt. Along the way we get support from our Dev/SecOps guy Big Bad John Welbourne-Smith.
Our data community also includes other DeadHappy people who use our data in their decisions such as our social media sorceress Kate and our product manager (herder of cats) Sam. We call in Big Bad John for help with the DevOps dark arts.
I think our community works well. One issue however, is that while Kanmi and I are nominally analytics engineers we’ve both got a bit of a bias towards engineering. That’s primarily my fault. We have been perhaps too focused at productionizing our data pipeline, at the expense of really spending the time analysing the end result. I’d hoped that the rest of the folks DeadHappy were ready to consume and learn from data if only we could get it to them more reliably. Realistically, DeadHappy needs more data leadership from us to harvest the insights it can from what we’ve produced so far rather than further developments to the Crypt. Kanmi and I are developing our skills in machine learning and statistics but I think that the next hires in marketing and product missions need to be analysts or data scientists. That said, we’ve laid a fairly solid foundation and, of course, there’s going to be a lot more we want to do with the crypt.
The Crypt
The Crypt is a tool to answer questions for people and DeadHappy missions. Missions are high level streams of activity such as growth or creating death products to die for. Missions provide vision for lower level streams of activity called ‘challenges.’ The Crypt will be finished when…
- It is a part of the fabric of all mission-level decision making.
- It runs by itself, it is secure but available, its documented and it lets us know and recovers when something goes wrong
- It automatically proves that it complies with our data policy
- DeadHappy people and partners have the right knowledge, reports, tools and support to answer their own questions
When I joined DeadHappy I made the decision to use Google Cloud Platform and especially BigQuery as the core tool for storing and processing data. The DeadHappy tech platform is built on AWS, so for us this meant going multi-cloud but I think it was the right decision. If you want more info on why we chose GCP over AWS reach out to me you can reach out to me. I’m writing about it further, and I’m happy to give my thoughts.
But the upshot of this decision is that BigQuery is the core of our Crypt. To the ‘left’ of the Crypt we have data pipelines and ELT and to the ‘right’ of Crypt we have dashboards, spreadsheets, other BigQuery projects and hopefully in the future machine learning pipelines and API’s.

A High Level view of our Data Warehouse
Data Ingestion: Overview and ELT
All data lands in a staging project, I’ve called the Sincophagus {read: data-sink / sarcophagus}. We have three basic kinds of data load.
- Snowplow events stream in from user browser sessions.
- Stitch loads data from our various applications including DeadHappy’s own app as well as stripe and typeform.
- Sheets in Google Drive.
If you are wondering why Sheets are a part of our ETL, you know those weird business rules that for some reason are only in someone’s head? I’ve made that people put the rules into a Google Sheet which is then read by the Sincophagus. This is a key move because if someone changes their minds and the rules change, our code doesn’t have to. As much as possible business logic should be parameters read from sheets editable by the business.
The DeadHappy application is hosted in AWS and uses an Aurora MySQL database. We scan tables for updates and load to BigQuery using Stitch every 30 minutes. I’d use an ELT tool for ELT every time. Creating manual ETL or ELT processes is time consuming and there isn’t likely to be anything that performing this task in-house adds to your business. Stitch has a good number of free rows a month, so it’s a good choice for a startup like ours.
The real challenge comes from the database itself. For Stitch to be aware of changes, deletes must be ‘soft’ deletes at least for the period of time it takes for Stitch to next scan the table. By a ‘soft’ delete, I mean that it must be marked for deletion and the updated_at field must be updated. Stitch then ‘sees’ that the row is marked for deletion before it actually disappears. We’ve had issues with people manually making deletes in the database without that crucial ‘deleted_at’ field and ‘updated_at’ field. It can be easy for rows to simply vanish or change without Stich passing this new information up to BigQuery.
Event Tracking: Snowplow vs Google Analytics
Going forward however the database will not be the most important source of information. We’d like to move to an events driven architecture. Our starting point on that journey is web events. Initially we used Google Analytics to monitor behaviour on the website. However, these considerations led us to choose Snowplow as our event tracker.
- The Google Analytics data that we were getting extracted into BigQuery is pre-aggregated. This can make it hard to work with because you cannot join to the data at an atomic level, you must first get everything to the level of aggregation they provide before working with the data.
- We wanted more autonomy about how we would tag up events and what information would be included in which events. Google has some provision for this, however…
- Neither Google analytics nor Google tag manager (the container that the universal GA tag sits in) are really version controlled. They have ‘versions’ but this wasn’t working for us. Everything we do is in Git and for engineers such as ourselves having the tags in Git is more important than a tool that abstracts away from the code base.
- Finally we’d like to move to a system where front and back end events are all tracked using a similar tool. GA only tracks web-events.
We currently have Snowplow tracker sitting in Google Tag Manager along with the Google Analytics tag and quite a few other tags used to track ad performance. This is not an ideal state for a few reasons. We’ve noticed that quite a few events are simply missed. I think this is due to race conditions on certain events between the Snowplow tracker and other trackers on the website. Basically, we’ve struggled managing Tag Manager. One solution would be to fix up our Tag Manager implementation, however for some of the same reasons we’ve opted to move away from Google Analytics we’ve opted for an in-house solution based on Snowplow and AWS which I’m excited to share this year (once we’ve actually got it working). The basic premise of it is that we will have one DeadHappy tracker embedded in our application and this will inform all the other services of what they need to know.
The Snowplow Pipeline
The Snowplow tracker sends event data to a ‘collector’ endpoint sitting in Google Cloud Platform. The collector service sits in Cloud Run and passes valid events to a PubSub Topic. We have Snowplow’s FS2 Enricher subscribing to that topic. The FS2 Enricher is Snowplow’s tool for obfuscating PII and adding geo data based on IP’s etc. A further Snowplow service ‘the Loader Mutator Repeater’ also sits in a VM. This loads data into our Sincophagus and changes the schema in BigQuery as necessary reloading data that failed due to schema change. With the painful exception of the two VM’s in this diagram all of the infrastructure is terraform code. I’d like to move to 100% infrastructure as code this year.

Again, any questions. Just ask. I’m happy to share.
The Warehouse
DBT is a powerful tool for compiling data models as select statements. We use it to transform our raw data into reports and models used for data analysis. Basically our warehouse is a set of tables and views defined by select statements in dbt. Roughly speaking data goes through a staging, filtering/cleaning, enhancing, aggregating, joining phases. DBT handles each of these phases as separate select statements. DBT uses the command line to interact with BigQuery to compile merge or create table statements as necessary to fill-out the data warehouse.
The philosophy driving our data modelling is about producing highly redundant highly reusable ‘wide tables’. The Wide tables are something like russian dolls in the sense that they are each based on an object at a different level of composition.
For instance, DeadHappy customers can apply for and buy insurance products: accordingly we have a wide table at an application-product level, a policy_product level, a policy level and a customer level. For web events, we work at the atomic level, at a pageview level, a websession level, a marketing session level, and a customer pathway level. This means that theoretically any analysis can be done at any level without the need to perform aggregations or joins.
Deploying DBT in Production on GCP
We host dbt as a service in Cloud Run. Our service has a few features that I think are worth bragging about. The service is revised each time we push to our repo. All the output from DBT are published to our logs and DBT’s errors are reported to the Error Reporting Service which in turn sends me a text message and slacks the data community. The Service reads the headers of whatever https request is sent to it to determine which dbt commands to run. Accordingly I can get it running on a schedule with cloud scheduler and have different models running at different frequencies or even with different variables.

Our set-up is fairly similar in effect to this one but we use terraform to create the infrastructure from code instead of using the console. I highly recommend doing things this way because you may find (as we did) that the first region you chose in gcp doesn’t support all the features you’d like. Porting the whole service to another region or to multiple regions can be a pain to do in the console or the command line because there are so many moving parts.
I’d like to go into detail about our set-up so I won’t say much on it here. I’m excited by the progress we made in 2021. 2022 is going to be a great year as DeadHappy grows enormously and we adapt and scale our data processes in tandem.