What do you do when you need to create a Data function from scratch?
You're being brought on board to create a Data department in a company that never had one before. Where do you start?
Note from Randall: Thanks for reading! I’ve been really pleasantly surprised by the response to this newsletter so far, with almost 60 people signing up in the first days. This is the first ‘proper’ edition, but you can read my introductory post here. This is a fairly detailed post, but I’ve made a simple summary at the end if you don’t want to read the whole thing.
So you’ve been asked to build a Data department from the ground up. Right now there is nothing, and you are responsible for getting to something.
Awesome!
But … where do you start?
This is a problem that will be encountered by pretty much every modern online business that hits a certain size. It’s no secret that data specialists aren’t usually in the first tranche of hires that a startup makes; it’s extremely rare to be in the first 10 people, and not common at all to be in the first 20. However, given how important data is to building and running a digital business, the moment inevitably arrives where it makes sense to establish a data function. Usually that means hiring one person to start with and then growing the team as the company scales up.
If you are the person who is given this task, it can be a big challenge to figure out what to do and where to start.
How do I know this? Well, I have been that person.
And it wasn’t easy!
[Spoiler alert: this newsletter isn’t going to reveal the exact formula for what to do in this situation, because that’s not possible, since every company is different. What I will try to do, instead, is provide you with some questions that will help you to think through tackling your own specific challenge.]
Why does a data team need to be built?
Look, let’s get real, I’m a data guy, so I’m always going to have a bias towards saying, “sure, you definitely need a data team!”
With that in mind, I should preface this section by saying that not every company needs a data team, and even beyond that not every company needs a single data person. However, every company deals with data of some kind, whatever they do, from the humblest kiosk to the mightiest multinational corporation, so it’s natural in the lifecycle of many growing companies that there comes a point where they decide to build a data function.
Based on my experience, there are roughly three main scenarios where a company decides to build a centralized data team.
There are specific data-related problems that have built up and need to be solved for the sake of the business
Company management has a vague feeling that ‘we should do more with data’
There are already data people distributed through the company (data analysts, data engineers, etc), but a decision has been made to integrate them into a central team
For the sake of this post, let’s say that scenario 3 is really a topic for another day. The topic of ‘centralized vs distributed data teams’ is a long-running discussion in the data world, and one I have strong opinions about, but a company that is making this decision already has some level of data sophistication, so it’s really outside the scope of this particular post.
Scenario 1: ‘OMG! We need to fix our data problems ASAP’
If you’re the person who is being recruited to create a data function from scratch, it’s very important to be aware that there is a high probability that something has gone wrong, and the company is hoping you can fix it for them.
Why?
Simply because, as mentioned before, all companies work with data to one degree or another, but the data needs of small companies are generally, well, small, and they don’t require the attention of a specialist. Therefore, a typical startup scenario is that the initial setup of data systems is done by, well, whoever is available and wants to have a go. It could be a full-stack developer, it could be the CTO … hell, it could be an intern. The reality for early-stage startups is that a properly thought-out data strategy is usually way, way down the priority list, so it’s important to get something out there so that the key metrics can be captured, and then if the company survives long enough, fix any problems that exist later …
Later eventually arrives, of course, and corners that were cut in the early days can generate problems that only compound as the company grows.
Therefore, if you are the data person who is being brought on board to fix things, it is crucial that you go into this situation with your eyes wide open; ask tons of questions (don’t be afraid to be annoying!) and try to understand as quickly as you can what are the main data-related pain points, and then understand how they impact the company.
What are some typical problems that you might encounter at this very early stage?
Missing data: Crucial information is either not captured or cannot easily be accessed, meaning that senior management and key players don’t have a clear understanding of business performance.
Manual processes: Lots of time is wasted manually copying and pasting information from different sources into Excel/G-Sheets; as the company is growing, this is becoming a bigger and bigger pain point.
Data quality issues: Very early stage companies are usually just trying to get the plane off the ground, hence relatively little thought is given to downstream analytical use cases. This can lead to major challenges with aligning data from different sources, particularly if there are no common standards around things like naming conventions for columns and values within backend tables, or inconsistent frontend tracking setups.
Data fragmentation: Often in early stage companies different teams set up their own tools, each of which has a data component, and the data from these different sources cannot easily be reconciled, which can lead to tremendous misunderstandings. This is the proverbial problem of not having a ‘single source of truth’.
Lack of documentation: Writing documentation is boring, which is often why it it’s done in a half-assed way. You might encounter as situation where nothing is written down and no one can explain why certain things were done the way they were. A personal example: a semi-automated process to dump data into a G-Sheet from a third party API, where no one knew anything about it, since the code was written by a long-gone contractor, and they didn’t write any documentation or even comments in the code - the developer had written a JavaScript script to work with a PHP API, so I had to figure out how to convert it all to Python in order to finish the automation.
Fragile infrastructure: What data infrastructure exists (if any exists at all), might have been hacked together in a rush, and can and does break frequently.
The most serious scenario of all: The company cannot provide information it has to provide, whether to investors (existing or potential), or to government bodies, such as regulators (if it’s in a regulated industry, such as financial services).
Taken together, it all sounds kind of painful, right?
Who needs such a headache?
The plus side of finding yourself in such a situation is that you have real problems to solve, and you know that company management has a real interest in you succeeding, which means that it will be easier to unlock the resources you need for success, whether those be financial, manpower, or just getting your questions answered.
Scenario 2: A vague feeling that ‘it’s time to do something with data’
If scenario 1 is the ‘hair on fire’ scenario, scenario 2 is much more chill. Basically, this is the scenario where company leadership has the idea that it’s the right time to get more professional with data, but hasn’t necessarily thought things through that deeply, since there isn’t the same urgency as would be encountered in the first scenario. Now, the reality is that once you are in the door, you might discover that actually there are serious problems that need to be fixed, but let’s assume that you won’t know that immediately …
Why might this second scenario arise? Well, everyone knows the cliche about being a ‘data-driven company’, and it’s certainly something that basically all business leaders aspire to these days. As a company grows, the gap between this aspiration and reality can become uncomfortable, leading to the decision to start a data department.
What might this look like in practice? Well, I can give a very good example, since I’ve experienced it myself, when I was brought in to Neugelb to create a data function. The genesis of this was that my boss Holger Gruenwald wanted to ensure that data played a big role in the design and development process for Neugelb’s revamp of Commerzbank’s main mobile banking app (you can read more about the app in German here).
There wasn’t really a fixed project to start with, just some aspirations, so I found myself with more or less a blank page to work from.
This was both exciting and moderately scary!
If scenario 1 has both intense pressure and high-level management buy-in, this scenario features less pressure to produce immediate results, but on the flipside you will need to work harder to get the resources you need. At Neugelb this meant in the early days that it was challenging to get the PMs to prioritize things like setting up custom events for frontend tracking, something that often slipped down the priority list compared to more obviously sexy things like new features. Similarly it was challenging to get analytics added into the QA workflows, and so it was a long process of lobbying and persuasion before it became a standard element.
In my case, if I were to go back in time and start over from scratch, there are some things I would have done differently; for example, I spent too long trying to be a ‘data MacGyver’, trying to do everything on my own. I waited too long to start hiring a team, and that meant both way more stress for myself than was necessary, and it also affected the velocity of data work; since I was doing everything myself, that meant I did stuff that I wasn’t very skilled at (like data engineering) quite slowly, which affected my ability to deliver in other areas.
Basically, I let my ego get in the way of thinking through what would be the best way of solving the problems facing me.
When you are the first data person hired at a company, you need to do a bit of everything. However, just because you can do a bit of everything, doesn’t mean you should, at least in the long term.
The second lesson I can impart from being a data team founder is that you don’t need to leap into action as quickly as you might want to. This can be tricky - who doesn’t want to hit the ground running? However, everyone will benefit from you taking some time to really understand the situation and then prioritizing accordingly.
Trust me on this!
What questions should you ask?
This is the key point; when you are building a data department from the ground up, you will have a lot to do, and until you can hire additional people, you will have only your own time, energy, and sanity to exploit, and you should do so wisely.
Therefore it makes sense to focus first on asking the right questions, getting answers, and then using that to decide where and how to focus. Of course, you most likely won’t have the luxury of avoiding operational tasks, but I strongly suggest using whatever influence you have to get yourself at least some breathing room to investigate and plan. You’ll thank yourself later!
So, if I were to wake up tomorrow and find myself starting a new data function at a new company, what questions would I ask?
These are the core questions I would focus on:
What does the company do? This may seem obvious, but this is the fundamental question to ask. You have to start here - you need to understand the business model. It might be that the company is best known for providing open source software, but if the lights are kept on by the sub-set of customers who buy the paid service, then, well, it’s not an open source company, it’s a SaaS company who also provide a product to the open source community.
What are the data needs of the business? Another obvious question, but one you need to answer. I would sit down and think through all the different departments and activities, and then go speak to all the relevant stakeholders to understand how they work with data, how they understand the importance of data to their role, and any hopes or aspirations that they might have in relation to data. If you don’t understand the hidden backlog of data tasks that is lurking in peoples’ mind, then you don’t have the full picture of the challenge before you. And I should be very clear here: don’t just focus on traditionally data-oriented teams like marketing or product, make sure you also speak to finance, operations, people, customer service, etc. Speak to everyone! You might then make a strategic decision to only support certain teams in the short-term (due to limited resources), but it will be a well-informed decision.
What kinds of data does the company generate? Generally speaking, there are four types of data that you will encounter in a tech firm, and you will need to investigate the status of all four types when you arrive:
Production / backend data: This is the data that is stored in the backend and is essential to the proper functioning of the service; for example, at Penta there were backend tables that covered everything from account details to financial transactions to foreign exchange to the ordering of new debit cards. This will always be a key source of data, but also potentially a problematic one, if there are issues with data quality or consistency. Your first task will be to investigate what kind of backend data you have to work with, how you can transfer it to an analytical system (if no such pipeline already exists), and what the different tables and fields mean.
Frontend / behavioral data: This is the data that is collected based on user behavior on frontend services on web or mobile; generally this is collected on the device, but sometimes it is sent from a server instead. Although some companies create their own frontend tracking solutions, most of the time companies rely on an external vendor’s tool to collect this data (there are tons of these tools on the market, such as Google Analytics, Segment, Snowplow, Piwik, Mixpanel, etc). Your first task should be to figure out if the company even has any frontend tracking in place, and then if it exists to examine how it’s been implemented, what events are collected, where does the data go, and if it’s GDPR-compliant.
Third-party data: This is a catch-all term for data that is stored on external services and then imported to your data infrastructure via API (or possibly some other method). The most common types of third-party data would be marketing and sales data from systems like Google Ads, Facebook Ads, Hubspot, Mailchimp, or Salesforce. In a situation where you are setting up a fresh data team, it’s extremely likely that this data is stored only on the third-party systems, so you will need to figure out (1) what info is worth ingesting, and (2) plan how and when to ingest that data into your own systems.
Other internal data: Is there a correct term for this stuff? If there is, I don’t know it. I sometimes call this ‘mushy data’, but in any case what I mean is the universe of spreadsheets that exist within the business that contain highly relevant information about different aspects of business performance on topics like finance, sales and marketing. It can be really beneficial to ingest into a centralized data warehouse, but also super painful. This probably won’t be a first order task to deal with, but it’s worth trying to get a handle on what’s there and who maintains it so that you can have it at the back of your mind for future usage.
Does the company have defined KPIs? If so, what are they and how are they generated? Every company will be following some metrics, but it’s your task to understand what they are and how relevant they are to business performance. If the management sees dozens of metrics as being ‘crucial’ to the business, then they haven’t really defined KPIs; you then need to begin the process of clarifying what are the most important metrics that steer the business. Similarly, this is also the moment to drill into how these metrics are calculated, as you want to double-check that they have been accurately counted. I can give a good example of this from earlier in my career, when I was at a VOD company: after joining I quickly found that the way that we counted video plays, was a total mess. What had happened was that the devs on each platform (web, mobile and connected tv’s) had created their own slightly different definitions of what constituted a video play, making it very challenging to compare performance across platforms. Not ideal when serving video is your main business!
How does data factor into current decision-making? To what extent do KPIs and other metrics influence the way senior leaders and key team leads make decisions? How do people assess the success (or not) of different initiatives? It might be the case that when you do some digging, you will figure out that the whole operation is running on gut instinct, which is not ideal, but it’s better that you know this up front.
What data infrastructure exists already? This is connected to the question about the data that’s generated. Is there any infrastructure in place already? In some cases you might find a rudimentary-to-decent data warehouse that’s been set up by someone (maybe the CTO or a full-stack developer), in others it might just be something like a PostgreSQL server, the free version of Google Analytics and some poorly formatted G-Sheets. Sometimes there might be basically nothing! In any case, you will want to understand where you are starting.
What are the biggest pain points with data? This is a crucial question that will help you to prioritize what has to be done now, and what can wait. Ideally you want to work with your stakeholders to identify projects that are relatively easy to do but will make a big impact in terms of improving everyday working life. It could be automating a process that is currently manually performed, or merging multiple data sets - whatever it is, if you can achieve it, it will give you instant credibility and buy-in for further projects.
What hiring plans are there for data? It is important to establish if you will be able to grow the team, and, if so, how many people you can bring in, and at what budget and what seniority level. Ideally you want to know how long you will be a one-man show, and when you can start to build a proper team.
What are your own strengths and weaknesses? This is one of the most important questions, and it’s especially pertinent when you are working on your own. Let’s get real, there are very few people who are experts in all areas of data; it’s an industry with quite distinct specializations. With that in mind, if you are the only data person in the company, be brutally honest with yourself, because you will only have so much money and time. Therefore, I would suggest spending the money on what you are not a specialist in; in my case, that would be data engineering, so if I was in such a situation again, I would buy off-the-shelf solutions for pipelines and other infrastructure tools as much as possible, and choose things that were simple and robust and wouldn’t require huge amounts of my time to set up and maintain. There would probably be a tradeoff in terms of cost and potential applications, but that’s worth doing in such a situation. I would then spend my time on areas where I was stronger, for example data modeling and data analysis. Think in a clear-eyed way about yourself and budget your time and money accordingly.
Thank you for reading - I hope this was helpful (that’s my goal!). Feel free to email me with any questions or comments, or add them in the comments field.
Don’t want to read the whole text? Here’s a quick summary for you
Building a data department from scratch can be a challenging task, especially for businesses that have not previously prioritized hiring data professionals. As data becomes increasingly important for companies of all sizes and industries, there may come a point where it makes sense to establish a data function. There are several reasons a company may decide to build a centralized data team, including specific data-related problems that need to be solved, a desire to do more with data, or the integration of already-existing data professionals into a central team. The process of building a data team will vary depending on the specific needs and goals of the company.
If a company has decided to build a centralized data team, it is likely due to specific data-related problems that need to be addressed. These problems may include missing data, manual processes that can’t scale, data quality issues, data fragmentation, lack of documentation, fragile infrastructure, or the inability to provide required information to investors or regulators. If you’re the data professional who is in charge of building the team, it’s crucial for you to figure out the main data-related pain points and how they impact the company. You should also be prepared to ask many questions, identify the root causes of the problems, and work to solve them in order to improve the company's data infrastructure and capabilities.
Key questions that I recommend asking:
What does the company do?
What are the data needs of the business?
What kinds of data does the company generate?
Does the company have defined KPIs? If so, what are they and how are they generated?
How does data factor into current decision-making?
What data infrastructure exists already?
What are the biggest pain points with data?
What hiring plans are there for data?
What are your own strengths and weaknesses?
One last (musical) thing …
You may or may not be aware of the fact that I’ve been a dj since 1997, which is … a little while. I’m a huge fan of electronic music, and collecting records in many styles and being a dj has been my main hobby since I was a teenager. I don’t play in clubs much these days, but I still regularly (very regularly!) make mixes for my Soundcloud and my blog.
Since I love music, I’ve decided to end each newsletter with one of my mixes.
Today, I’m going to share one of my favorite mixes from my back catalogue, Rolled In Sunshine. This was a mix I made four years ago dedicated to what I call ‘rugged jungle soul’:
A little blurb from the blog post to accompany the mix:
What’s nice about this mix, for me at least, is that it points at a direction for jungle music that wasn’t so well explored in the end – jungle as a rugged form of soul music, with sweet melodies and vocals rubbing up against tough breaks – the sounds of rare groove and jazz funk reimagined in a rave context. In the endlessly bifurcating river of breakbeat music, the ‘deeper’ side of this music, as represented in this mix, ended up evolving by 1996 into a very smooth sound, with well-mannered beats matched to tasteful samples from jazz and funk ... I much prefer this tougher sound to what came later, which sometimes sounded a bit like it was designed to soundtrack shampoo commercials.