My data package
Introduction
My research aims at understanding the transmission mechanisms of neglected vector-borne diseases. I mostly deal with data on the distribution and diversity of vectors of diseases and their infection status. The metadata would include but not be limited to the date of sample collection, location and GPS coordinates of the sites of sample collection, type of sample (blood or fly sample), the concentration of RNA or DNA extracted from the samples, and the infection status of the samples (whether the samples are infected with pathogens or not) as well as the blood meal sources of the insect vectors. All these datasets are supposed to be presented in a way that it can be understood by whoever accesses it and that information regarding the licensing and other attribution information can easily be accessed. One way to reduce friction when dealing with such huge datasets is to put them in a container that groups all the descriptive data and schema together. A schema tells us how the data is structured and the type of content that is expected in it. All this is contained in a data package that can be generated by a data package creator.
Packaging my data
I am going to take you through step by step process on how I created a data package for my datset on sandflies diversity, infection status, and their blood-meal sources, using Frictionless Data Package Creator. My data containing all the relevant metadata was compiled in excel and saved it in comma-separated values (csv) format. To create a data package I first loaded the csv file into the Frictionless Data Package Creator. Secondly, I edited the metadata descriptors to include the name as kevinsandfly_data and the title as Sandfly data package. I then selected the profile as Tabular Data Package as well as set the license under the Creative Commons Attribution 4.0 (CC-BY-4.0). Thirdly, I validated my data package to ensure the profile selected was correct for my data package. Finally, I downloaded the validated data package in JSON format.
Figure showing the steps for creating a data package in a data package creator
My data schema
{ "profile": "tabular-data-package", "resources": [ { "name": "resource1", "path": "kevinsandfly_data.csv", "profile": "tabular-data-resource", "schema": { "fields": [ { "name": "Sandfly No.", "type": "string", "format": "default" }, { "name": "Slide No.", "type": "string", "format": "default" }, { "name": "DNA No.", "type": "integer", "format": "default" }, { "name": "Trap NO.", "type": "string", "format": "default" }, { "name": "Collection site", "type": "string", "format": "default" }, { "name": "Location", "type": "string", "format": "default" }, { "name": "Collection date", "type": "integer", "format": "default" }, { "name": "Sandfly sp.", "type": "string", "format": "default" }, { "name": "SEX", "type": "string", "format": "default" }, { "name": "Feeding status", "type": "string", "format": "default" }, { "name": "Leishmmania ", "type": "string", "format": "default" }, { "name": "Latitude", "type": "integer", "format": "default" }, { "name": "Longitude", "type": "integer", "format": "default" }, { "name": "Altitude", "type": "integer", "format": "default" }, { "name": "Outdoor/indoor", "type": "string", "format": "default" } ] } } ], "licenses": [ { "name": "CC-BY-4.0", "title": "Creative Commons Attribution 4.0", "path": "https://creativecommons.org/licenses/by/4.0/" } ], "title": "Sandfly data package", "name": "kevinsandfly_data" }
Conclusion
In summary, I found the Frictionless data package creator to be useful in packaging my dataset in one container with all the metadata and other relevant information. This allows the dataset to be in a format that can be shared and easily understood by others.