Automating Data Coverage Updates

As an analyst I would recieve monthly excel sheets from our data vendor that described their data coverage.

Summarising changes in their coverage took time, so I decided to programme the problem away.

The user should load the file, run the programme, and then see a heatmap of our vendor’s data coverage and a list of changes in coverage compared to last month’s file.

In short, the programme had to score our vendor's data coverage in different countries and save the output as a heatmap and compare the heatmap to the previous month and log changes.

Data scoring

After loading the data, we extract the date when the file was last updated by the vendor so that we can organise the Excel sheets and identify any changes in their data.

Now that we’ve extracted both the data and the last updated date, we can score the data and organise insights by month.

Scoring involves analysing the freshness of data. We therefore map categorical data to corresponding values in our freshness_dictionary:

data.loc[:, 'Freshness'] = data.apply(lambda row: (row['Cadence (Incorporation to OC Entry)'] + row['Cadence (Updated to OC)']) / 2, axis=1).round(0)

data.loc[:, 'Freshness'] = data['Freshness'].map({v: k for k, v in freshness_dictionary.items()})

The next step is to assign 10 where there is firmographic and officer data and 0 where no data exists.

data.loc[:, 'Firmographic'] = data.apply(lambda row: 10 if row['Legal name'] == 1 else 0, axis=1)

The result is a dataframe with converted freshness and scores showing the states and countries where our vendor has firmograhpic and officer data.

Once scored, we import our file containing a normalised list of the world’s countries and states and perform an outer merge on our data frame and fill na values with 0.

jurisdiction_table = pd.read_csv('scripts/lookup/template_jurisdictions.csv')

oc_coverage = data.merge(jurisdiction_table,how='outer', on='Jurisdiction')

The result of merging the data frames is a heatmap of the countries in the world where our data vendor has data.

We save this heatmap in our output directory and append the date when the file was last updated to the name of the file.

Log changes

The second step is to compare the heatmaps and identify any differences between the latest update from our data vendor and last month’s file.

The heatmap files no longer have a cell where the date used to be, so we extract it from the name of the file instead by calling the add_date_columns function.

def add_date_columns(output_file, output_data):
      filename_date_str = output_file 
      filename_date = parser.parse(filename_date_str, fuzzy=True)
      output_data['Date'] = filename_date.strftime('%b')
      return output_data

The add_date_columns function takes the date in the heatmap’s filename and uses the date to populate a column in the heatmap. This is useful because we’ll merge the files to create one dataframe and the dates will act as a unique identifier.

The main.py then passes the list of files to our merger function, which essentially selects relevant columns and performs an outer merge on those columns.

With the returned dataframe called merged_data, we can call our function change_log in main.py to compare heatmaps and register changes in a log.

The change_log function compares the different heatmaps and logs changes by month in a log that contains changes across all months.

The function starts by creating a pivot table from the merged data frames. It then pivots the table by date. The result is a multi-indexed dataframe with hierarchical data.

We then create variables to compare the months; column 0 is always the value of the latest month and column 1 is the value of the second month we want to compare.

With these variables, we can compare the months. If it is true that firmographic 1 is the same as firmographic 2, then True is inserted into the no change column. The same goes for all columns.

Once we have booleans to demarkate where there’s been a change in value between different months, we save the output to a master_change_log. T

This file is updated everytime we run the programme and it contains all the months.

Each month when we get a new file from our vendor and run the script, the month will be repesented in a column on this log.

The master_change log is helpful for viewing change overtime, but what if we want a quick update that highligths the latest changes so we can tell the team and key stakeholders?

To achieve this, we create a mask that lets us filter rows where there’s been a change. This narrows our dataframe down to a short table that we save as an Excel file.

From business problem to solution

I said earlier that ‘knowing about updates in our data before that morning coffee with a potential client can be the difference between deal and no deal’.

With this Python programme, I can now update our sales team about changes in our data in less than the time it takes them to finish their coffee.