How can you use R to clean up messy spreadsheet data for visualization?

Cleaning spreadsheets made fun and easy

Feb 15, 2023

In my first Data Rampage newsletter, I wrote about the different types of data that you will typically encounter inside a company:

Production / backend data: This is the data that is stored in the backend server(s) and is essential to the proper functioning of the service.
Frontend / behavioral data: This is the data that is collected based on user behavior on frontend services on web or mobile; generally this is collected on the device, but sometimes it is sent from a server instead.
Third-party data: This is a catch-all term for data that is stored on external services and then imported to your data infrastructure via API (or possibly some other method). The most common types of third-party data would be marketing and sales data from systems like Google Ads, Facebook Ads, Hubspot, Mailchimp, or Salesforce.
Other internal data: Is there a correct term for this stuff? If there is, I don’t know it. I sometimes call this ‘mushy data’, but in any case what I mean is the universe of spreadsheets that exist within the business that contain highly relevant information about different aspects of business performance on topics like finance, sales and marketing.

The fourth type is an interesting one, because it seems to attract relatively little attention in online data discussions; how can you as a data professional take spreadsheet data from other parts of the team and prepare it for analytical use?

In an ideal world you as a data leader would only deal with immaculately clean data sets, those that require no grunt work to prepare for analysis and other cool projects.

That’s not the world we live in, though …

Realistically, a lot of crucial data in any given business is happening in spreadsheets, and for data teams these can be a treasure trove of helpful information that can be used to add extra richness and context to data coming from the first three types I discussed above. Unfortunately, however, much of this data is not ready for analytical use, so if you want to ingest it into a data warehouse or even just make a visualization with it, you will need to clean it up.

I know I’ve buried the lede, but this post is about exactly that: how to take a spreadsheet that is messy (well, only somewhat in this case), and clean it and prep it for visualization use.

The spreadsheet in question

A few weeks ago a relative sent me an Excel spreadsheet and asked for some help cleaning it up so that they could chart the values over time and try to identify some trends. I loaded it into R, quickly tidied it up, loaded it into Google Sheets, and made a visualization with trendlines for them. Done!

Afterwards I realized that this might be a good educational example, so with their permission I’m sharing it here, although I’ve renamed the columns with the relevant measurements to ‘Variable 1’, ‘Variable 2’, and ‘Variable 3’.

If you just want the R code, it’s right here on Github.

Let’s have a look at the spreadsheet:

Here’s how it’s structured:

There are four columns, with the left-most having both dates and times, and then three variable columns
The dates are on rows of their own, while the variable measurements are aligned with the times they were taken
There are spaces between each date-time block

Taken together, this is pretty easy for a human to scan through and read, and unfortunately not easy to use for any kind of analysis. To be able to build a time series, you need to reorganize the data so that each measurement is on the same row as the date it was taken on, at the bare minimum.

Doing this manually means dragging around the dates, deleting the empty rows, and other work … with over 250 rows to work through, this will get old, fast. A better way is to do it via code, and in this post I’m going to walk you through exactly how I did so.

Loading and Cleaning

The original file came to me as an Excel document, and I loaded it into R using the readxl package, however for the sake of this newsletter, it was easier for me to load the anonymized data into Google Sheets, which you can access here.

If you want to follow along, you will need to have a Google account and follow these steps:

Install R from CRAN
Install RStudio as your IDE
Use this code snippet to install the necessary packages and follow along:

install.packages(c("tidyverse", "googlesheets4","lubridate","snakecase"))

The first step is to load the packages, authenticate your Google account, and load the data frame from Google Sheets; I’ve also included my function tidycols, which reformats column names into a nice snake case format.

library(tidyverse)
library(lubridate)
library(googlesheets4)

#function for cleaning column names

tidycols <- function(df) {
  require(snakecase)
  
  dfnames <- colnames(df)
  
  dfnames <- gsub('#','nr',dfnames,fixed=TRUE)
  
  dfnames <- gsub('%','pct',dfnames,fixed=TRUE)
  
  dfnames <- to_snake_case(dfnames,sep_out = "_")
  
  dfnames <- tolower(gsub(" ","_",dfnames))
  
  dfnames <- gsub(".","_",dfnames,fixed=TRUE)
  
  dfnames <- gsub("/","_per_",dfnames,fixed=TRUE)
  
  colnames(df) <- dfnames
  
  return(df)
  
}

#authenticate

gs4_auth()

#read the file in

url <- 'https://docs.google.com/spreadsheets/d/1n1YZSThrLEjqj7kvRD8rKJsPF5NAe_kGhSzmZtYv4RM/edit?usp=sharing'

test <- read_sheet(url,skip = 4,col_types = 'c') %>%
  tidycols()

You will see in the read_sheet function, I’ve added two additional arguments besides the spreadsheet URL, these are so that I skip four rows (thus skipping the blank space at the top of the spreadsheet), and that all columns are read in as data type ‘character’’. I did this because I was worried that the first column had both dates and times, and weird stuff might happen with the auto-detect. If everything is treated as a string, you get it in and fix it later!

Let’s see what the data looks like once it’s been read in:

Well, that’s definitely it! But as you can see it’s still not particularly usable for analysis.

Let’s clean it up!

#clean the data

test2 <- test %>%
  #remove missing rows
  filter(!is.na(time)) %>%
  #create a proper date column
  mutate(date = ifelse(is.na(variable_1),time,NA),
         date = dmy(date)) %>%
  #fill missing values downwards
  fill(date,.direction = 'down') %>%
  #remove empty rows
  filter(!is.na(variable_1)) %>%
  #fix variable data type
  mutate_at(vars(starts_with('variable')),as.numeric ) %>%
  #create a time stamp
  mutate(time_new = paste(time,':00',sep=''),
        timestamp = paste(as.character(date),time_new),
         timestamp = as.POSIXct(timestamp,format="%Y-%m-%d %H:%M:%S")) %>%
  #tidy up and reorder
  select(-time_new) %>%
  select(date:timestamp,everything())

OK, there’s a bunch of steps, so I will go row-by-row explaining everything.

Remove any rows where there is no value for ‘time’; this gets rid of the spacing rows
Create a new date column by taking the value from ‘time’ if ‘variable_1’ is empty, otherwise leaving it blank
Fill the ‘date’ column downwards; this fills empty space until the next value, so each measurement will now have the correct date
Remove every row with no value for ‘variable_1’, this removes the rows that have a date value in ‘time’ but are otherwise empty
Convert all the variable columns to a numeric data type
Create a timestamp column by combining the pasting the date and time columns together and then converting to the timestamp data type
Re-order the columns

What does this look like after this work?

Now this is a big improvement! This is now cleaned and ready for analysis; instead of spending a long time manually deleting rows and copying and pasting date values, everything has been rendered nice and tidy in only a few lines of code.

We can take this even further by pivoting the data into a long format, like this:

test_clean <- test2 %>%
  pivot_longer(cols = c(variable_1,variable_2,variable_3),names_to = 'measurement')

This collapses the individual variable columns into two columns called ‘measurement’ and ‘value’, making the whole dataframe longer and narrower, like this:

Loading the cleaned data back into Google Sheets

Once the data is cleaned up you can do, well, whatever you want!

But let’s say that you want to load it back into Google Sheets, here’s an easy way to do so:

#write it to a GSheet

gfile_url <- 'YOUR NEW GOOGLE SHEETS URL'

sheet_write(test_clean,ss=gfile_url)

I’ve loaded the data to Google Sheets here, so you can see what it looks like.

Bonus Time! Visualization with ggplot2

I’ve always been a huge fan of the R visualization package ggplot2, which I think is possibly the single best charting tool out there, paid or not.

As a bonus, therefore, here’s a quick visualization of the data, where I’ve created an average value for each date:

As you can see, this is quite a nice, clean looking chart.

Here’s the code I wrote to make it:

test_clean %>%
  #get the average value per day
  group_by(date,measurement) %>%
  summarize(value = mean(value)) %>%
  ungroup() %>%
  #visualize it with date on x axis and value on the y
  ggplot(aes(x = date,y = value,col=measurement ))+
  geom_line()+
  scale_x_date()+
  scale_y_continuous(limits=c(0,160))+
  theme_minimal()+
  #add a trendline
  geom_smooth(method='lm',se=F,linetype = 2,alpha=0.35)+
  ggtitle('Data Rampage Data Cleaning Example')+
  #move the legend to the bottom
  theme(legend.position = 'bottom')+
  labs(col='Measurement')+
  ylab('Value')+
  xlab('Date')

Housekeeping

Thanks for reading! Next up, either later this week or early next week I’m going to be starting a multi-part series on tracking design, a topic that is really important, but where it can be difficult to find good quality information on the internet. I’m going to look at the fundamental questions you need to ask when creating a tracking plan, and I’m also going to go through a few distinct approaches to tracking design.

Ironically enough, I saw that Timo Dechau of Hipster Data Stack has in fact this week written about exactly this same topic; and his post is well worth reading for anyone interested in the topic:

the hipster data stack

More than 30 unique tracking events will cause you problems

I really have a problem here. People don’t take me seriously about my take on too many unique events. Ok, maybe I was a bit too much clickbaity in the past. Here is my video - You only need eight events to track your business…

2 years ago · 2 likes · 1 comment · timo dechau 🕹🛠

My thoughts and emphasis will be somewhat different, but I still highly recommend reading this.

If you missed my last post on marketing attribution using Snowplow Analytics, you can see it here:

Data Rampage

How to use Snowplow Analytics for marketing attribution

Last week I wrote about my new gig: Head of Data, Analytics and AI at Snowplow Analytics. It’s pretty exciting! Although I did say that that this newsletter isn’t going to become about all things Snowplow, and that’s true, this edition will however be the exception …

2 years ago · Randall Helms

I ended up doing some rewriting of it from the original version, but the core information is there. I think it’s an interesting (and helpful) read!

One last (music) thing

Every post ends with a dj mix from my massive back catalogue (I’ve been a dj for almost 26 years now). Today, I’m sharing Don’t Give Up, a breakbeat mix I did two years ago. This isn’t my main usual style, but I’ve really enjoyed the midtempo breakbeat revival that’s happened in recent years, particularly in the UK, so I put this mix together of some of the records in that style that I’d picked up recently. I am really happy with this one!

That’s it for now, thanks for reading and I’ll be back soon …