image/svg+xml

Goodtables Data Validation

Introduction

During the fellowship programme, I learned that if you want to share your research data with others or archive them for future (re)use, data should have some contextual information, be clean and of high quality. Contextual information for your data can be generated by using a data package tool (read my previous blog where I explain step by step how I created one for my data). Providing context is super important for research reproducibility, but you also want to make sure that your data are clean and errorless. For this purpose, you can use a Goodtables Tool.

What is Goodtables?

Goodtables is a free, open-source tool that helps to validate data. It automatically identifies errors in a dataset and allows you to quickly fix them. There are two ways to validate data using Goodtables:

  1. the one-time data validation via a browser tool or command line;
  2. the continuous validation for data hosted on GitHub or other open repositories.

In the following section, I will explain my use of a Goodtables browser tool.

A Goodtables browser tool

A Goodtables browser tool is straightforward and easy to navigate. Basically, you should provide it with a tabular data source (preferably in a CSV format) to check your dataset for structural problems, such as blank rows, duplicate header”, etc. And if you have a data schema (which you can generate using a data package tool, Goodtables also allows you to check your data for content errors, like type-format-error* (for the full list of errors identified in the validation process, check this page).

To validate my data, I uploaded the dataset and the data schema files to my GitHub repository. From there, I copied the files’ URLs and pasted them to Source and Schema inputs on try.goodtables.io, respectively. Then, I hit the Validate button, which gave me thousands of type-format errors, as you can see in the picture below:

image with datatype errors

Thankfully, Goodtables identifies error’s specific location in a dataset. So, following the list (highlighted with red in the picture above), I fixed all the errors in my data and ran the second validation test. This time, the data came back as valid:

image with with valid data

In this blog, I explained my use of a Goodtables browser tool. However, I also tried to validate my data using a command line and python frameworks just to familiarise myself with the frameworks before I actually start learning and using them in the coming weeks.