Using the frictionless framework for data validation
You created your first data package!
In the last episode, I described a general problem I faced during my PhD: not being given enough information about materials, data, and data analysis processes amongst others. I struggled a lot, but the Frictionless Fellows program helped me realize that 1) this is something very common in research, 2) it doesn't mean that this problem is your fault, 3) open science tries to give solutions to these problems. So I created a subset of my PhD data and then used the data package creator to package and validate the data.
What's next?
Being a fellow for 6 months, I feel like my programming skills have been boosted, and this is partly due to the things I'm being taught, but also due to my increased confidence in tackling coding problems. I definitely used to have a coding-phobia, thinking that every error message I got was my fault!
Apart from increased confidence and better programming skills, I also have more data, so this was an opportunity to enrich my package with the latest data. What was interesting was that I have been coding in R for the past few months, so the change to Python was definitely noted! I updated my last .csv file and made sure I pushed the changes in my repository.
Using the goodtables.io tool, I quickly validated my newly pushed data, and got the 'data-valid' sticker.
My package is valid! .
My celebrations didn't last long, however...
Using the Frictionless Framework
Having braved myself that I was going to use Python to validate my data, I turned to the frictionless framework guide. I made sure I followed the installation instructions (and since I hadn't done anything similar before, I had to make sure that Python was in my PATH directory). I used Python version 3.7.3 and frictionless version 3.48.0.
Now, having installed the frictionless framework, I made sure that the working directory was the one I had my datapackage in. There are probably more efficient ways to set this, but at this point this is what works for me.
I asked frictionless to describe the data (this works either on command line or a Python shell (without the "!")):
! frictionless describe correlational_fordatapackage.csv
And I got this:
---
metadata: 16-11-20_correlational_fordatapackage.csv
---
compression: 'no'
compressionPath: ''
control:
newline: ''
dialect: {}
encoding: utf-8-sig
format: csv
hashing: md5
name: 16-11-20_correlational_fordatapackage
path: 16-11-20_correlational_fordatapackage.csv
profile: tabular-data-resource
query: {}
schema:
fields:
- name: participant
type: integer
- name: age(1st session)
type: integer
- name: PPVT_raw
type: integer
- name: PPVT_z
type: number
- name: RCPM_raw
type: integer
- name: RCPM_z
type: number
- name: Srep_structure_correct/32
type: integer
- name: morph_prod/24
type: integer
- name: morph_comp/31
type: integer
- name: syll_manip/12
type: integer
- name: phon_manip/8
type: integer
- name: pseudoword_rep/48
type: integer
- name: fwd_recall/54
type: integer
- name: fwd_recall_range/9
type: integer
- name: bwd_recall/45
type: integer
- name: bwd_recall_range/9
type: integer
- name: fwd_corsi/54
type: string
- name: fwd_corsi_range/9
type: string
- name: bwd_corsi/45
type: string
- name: bwd_corsi_range/9
type: string
- name: melody_score
type: integer
- name: rhythm_score
type: integer
- name: cBAT_score
type: integer
- name: complex_cBAT
type: integer
- name: simple_cBAT
type: integer
- name: chat_d'
type: number
- name: trackgridprop
type: number
- name: trackgridprop_z
type: number
- name: trackrecallprop
type: number
- name: trackrecallprop_z
type: number
scheme: file
stats:
bytes: 3000
fields: 30
hash: d5fca8495f9cf143ad4b75b83466f4e9
rows: 22
Thus, similar to what the data package creator and goodtables.io does, frictionless detects your variables and their names, and infers the type of data. However, it detected some of my variables as strings, when they are in fact integers. Of course, goodtables did not detect this, as my data were generally -in terms of formatting- valid. Not inferring the right type of data can be a problem both for future me, but also for other people looking at my data.
The cause of the problem was four cells with non available data, that I had completed as "NA". Frictionless "saw" the strings, and inferred the whole variables as strings. I had to find a way to tell frictionless to ignore those "NA" cells. This line of code solved the issue:
resource = describe("correlational_fordatapackage.csv", infer_missing_values=["", "NA"])
It turned my "NA" cells into empty cells. Thus, when calling the "resource" variable I got the right description.
{'name': 'correlational_fordatapackage',
'profile': 'tabular-data-resource',
'path': 'correlational_fordatapackage.csv',
'scheme': 'file',
'format': 'csv',
'hashing': 'md5',
'encoding': 'utf-8-sig',
'compression': 'no',
'compressionPath': '',
'control': {'newline': ''},
'dialect': {},
'query': {},
'schema': {'missingValues': ['', 'NA'],
'fields': [{'name': 'participant', 'type': 'integer'},
{'name': 'age(1st session)', 'type': 'integer'},
{'name': 'PPVT_raw', 'type': 'integer'},
{'name': 'PPVT_z', 'type': 'number'},
{'name': 'RCPM_raw', 'type': 'integer'},
{'name': 'RCPM_z', 'type': 'number'},
{'name': 'Srep_structure_correct/32', 'type': 'integer'},
{'name': 'morph_prod/24', 'type': 'integer'},
{'name': 'morph_comp/31', 'type': 'integer'},
{'name': 'syll_manip/12', 'type': 'integer'},
{'name': 'phon_manip/8', 'type': 'integer'},
{'name': 'pseudoword_rep/48', 'type': 'integer'},
{'name': 'fwd_recall/54', 'type': 'integer'},
{'name': 'fwd_recall_range/9', 'type': 'integer'},
{'name': 'bwd_recall/45', 'type': 'integer'},
{'name': 'bwd_recall_range/9', 'type': 'integer'},
{'name': 'fwd_corsi/54', 'type': 'integer'},
{'name': 'fwd_corsi_range/9', 'type': 'integer'},
{'name': 'bwd_corsi/45', 'type': 'integer'},
{'name': 'bwd_corsi_range/9', 'type': 'integer'},
{'name': 'melody_score', 'type': 'integer'},
{'name': 'rhythm_score', 'type': 'integer'},
{'name': 'cBAT_score', 'type': 'integer'},
{'name': 'complex_cBAT', 'type': 'integer'},
{'name': 'simple_cBAT', 'type': 'integer'},
{'name': "chat_d'", 'type': 'number'},
{'name': 'trackgridprop', 'type': 'number'},
{'name': 'trackgridprop_z', 'type': 'number'},
{'name': 'trackrecallprop', 'type': 'number'},
{'name': 'trackrecallprop_z', 'type': 'number'}]},
'stats': {'hash': 'd5fca8495f9cf143ad4b75b83466f4e9',
'bytes': 3000,
'fields': 30,
'rows': 22}}
Phew, that was definitely hard work! Unfortunately, my work here is not done, if we want to update everything. Data Package Creator created a very neat schema, and we can try and replicate that using Python. So we call describe_schema:
from frictionless import describe_schema
schema = describe_schema("correlational_fordatapackage.csv")
schema.get_field("participant").title = "participant code"
schema.get_field("age(1st session)").title = "age"
schema.get_field("age(1st session)").description = "participants' age (in months) in the first session"
So we got the specific fields and added titles and descriptions. Hopefully now the project has more information.
Conclusion
I tried to tackle a few issues with data validation using both user-friendly options, but also using the command line and Python. It's not easy, as I still have a long way to get my head around all of this, but practicing it made me think about schemas, thinking in different programming languages, and reinforced what I have been experiencing during the whole fellowship experience: that I can actually do it!