emo-bon / data-quality-control-action Goto Github PK

Action to run the QC on the EMO BON logsheets and transform them

Dockerfile 0.85% Python 99.15%

data-quality-control-action's Introduction

data-quality-control-action

on: [push]
jobs:
  job:
    runs-on: ubuntu-latest
    steps:
      - name: checkout
        uses: actions/checkout@v3
      - name: data-quality-control-action
        uses: emo-bon/data-quality-control-action@main
        env:
          PAT: ${{ secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
          ASSIGNEE: <github_username>

with:

PAT: a personal access token or automatic authentication token
REPO: repo in which to create an issue for end user notification
ASSIGNEE: github username of end user to notify

data-quality-control-action's People

Contributors

Watchers

data-quality-control-action's Issues

dealing with lists

the rules for identify and dealing with a list should be turned around
if the logsheete_schema_extended says it should be a list then

if there are no ";" in the string, then treat it as a 1 element list,
if there are ";"s in the string, then treat that as a separator
strip any trailing blank (in case the ";" is put at the end of the list with nothing else after it

etl action is not working

We tried to run the "etl action" which you do for individual observatories, but it did not work
Unsure what is going wrong, but it needs to be working asap

At the same time, it may be useful to have more logging of what is going wrong, because there is a good chance it is something to do with the logsheet values not conforming to requirements - the problem is figuring out which data are doing this

Change QC pipeline

Some changes are required to the workflow that harvests - QC - filters - transforms

For the transforming part:

any entries that say "expected [something]" or "Expected [something]" should be transformed to NA
any blanks whatsoever should be transformed into NA (if not done already)
while the orcid is supposed to be an anyURI, if it is a string of the form 0000-etc, then can you prepend https://orcid.org/ to it
transform Y,y,yes,Yes, YES to 1 and NO, No,N,n to 0 ; and T,True,true to 1 and F, False, false to 0

add QC and tranformation and ttl rules for ARMS data

Work to do in the autumn, to allow the ARMS logsheets to be incorporated

Automatic README when observatory repos are created

I am not entirely sure where to put this issue, so feel free to move it if necessary
All observatories should have the same README, and it can be created automatically when the repo is created (but I am not sure what action does that), and obviously should not be overwritten when the harvesting is done

The text for that should be put somewhere identified, so that when we need to change it, it can be done by the data manager rather than by the developer.

The text we want added now is:
This repository contains the EMO BON logsheets for this observatory, in various formats

First they are harvested from the google drive, and put in the logsheets/raw folder
They are filtered to contain only the date range that will be quality controled (as dictated by governance data and put in the logsheets/filtered folder
These are subjected to a quality control, of which reports are added to the data-quality-control folder
At the same time, some data transformation rules are applied and these cleaned-up logsheets are added to the logsheets/transformed folder
The logsheets are then turned into turtle format, following templates and they are added to the sediment and/or water folders, in the subfolders for the three tabs of each logsheet (observatory, measured, and sampling)

Also, a new observatory repo should have the following "About" description added: EMO BON observatory - logsheets

change QC for measurements to deal with ><-

There are values in the mesurements tab that are not floats, but include a < value (a maximum, because it means that this is the lowest value that the instrument could record)
And in principle some may also have a > (a minumum) and - (a range), tho I do not expect these

These have to be deal with in the following way

for the QC: still report a problem
for the transformed logsheets: copy over as-is - this means that ALL measurements have to be recorded as strings in logsheets_schema_extended
for the ttl - @laurianvm or @bulricht to change the procedure so that: If there is a "< float" in the cell, put the float as the value in the ttl for that parameter, and indicate that it is a maximum value; if there is a "> float" then put the float as the value and indicate that it is a minumum value; if there is a "float - float" then record it as a range min to max.
We will need to document that this is done because it will mean that not all values for measurements are described in the same way and the VRE code that may want to do maths on these data will run into problems. To consider: perhaps make ALL values at least a range, from min to max?
for the ENA workflow, as it will not accept strings, we will have to NOT include that value in the xml file BUT we will have to instead create a new entry called "[measurement name] additional info" and copy the string into there. See emo-bon/ena-sample-registration-action#3

dealing with blanks in the logsheets

For rows that contain blank replicate types, a number of changes are needed

Instead of calling them blank1,2 when repeated, call them just blank but have a _1 and _2 in the source mat id. Need to figure out how to do this - see emo-bon/observatory-hcmr-1-crate#13
the only relevant information from the sampling tab for blanks are the following:
scientific name --> it's mandatory, so yes. Now it is "unidentified" but I think this is wrong. I added an issue about it.
tidal stage --> optional, so no. Also, not relevant.
depth --> mandatory, so yes. For now, I have used 0 (there has to be a number).
sampl collect device --> optional, so no. Also, not relevant.
samp mat process and deviations --> optional, but I would say yes.
membr cut --> optional, but I would say yes.
size frac low and up --> optional, but I would say yes.
time fi --> optional, so no. Also, not relevant.
env_material --> it's mandatory, so yes. Right now, for the sediment blanks it is "swab" for the water blanks it is "Milli-Q water" (based on the SOPs).
no measurements should be added for these samples

We will need a separate function to deal with blanks for the transformed logsheets OR just for the ttl template. To be taken up after my summer break

emo-bon / data-quality-control-action Goto Github PK

data-quality-control-action's Introduction

data-quality-control-action

data-quality-control-action's People

Contributors

Watchers

data-quality-control-action's Issues

dealing with lists

etl action is not working

Change QC pipeline

add QC and tranformation and ttl rules for ARMS data

Automatic README when observatory repos are created

change QC for measurements to deal with ><-

dealing with blanks in the logsheets

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent