Git Product home page Git Product logo

data-quality-control-action's Introduction

data-quality-control-action

on: [push]
jobs:
  job:
    runs-on: ubuntu-latest
    steps:
      - name: checkout
        uses: actions/checkout@v3
      - name: data-quality-control-action
        uses: emo-bon/data-quality-control-action@main
        env:
          PAT: ${{ secrets.GITHUB_TOKEN }}
          REPO: ${{ github.repository }}
          ASSIGNEE: <github_username>

with:

  • PAT: a personal access token or automatic authentication token
  • REPO: repo in which to create an issue for end user notification
  • ASSIGNEE: github username of end user to notify

data-quality-control-action's People

Contributors

bulricht avatar cedricdcc avatar

Watchers

Marc Portier avatar Ioulia Santi avatar

data-quality-control-action's Issues

dealing with lists

the rules for identify and dealing with a list should be turned around
if the logsheete_schema_extended says it should be a list then

  • if there are no ";" in the string, then treat it as a 1 element list,
  • if there are ";"s in the string, then treat that as a separator
  • strip any trailing blank (in case the ";" is put at the end of the list with nothing else after it

etl action is not working

We tried to run the "etl action" which you do for individual observatories, but it did not work
Unsure what is going wrong, but it needs to be working asap

At the same time, it may be useful to have more logging of what is going wrong, because there is a good chance it is something to do with the logsheet values not conforming to requirements - the problem is figuring out which data are doing this

Change QC pipeline

Some changes are required to the workflow that harvests - QC - filters - transforms

For the transforming part:

  1. any entries that say "expected [something]" or "Expected [something]" should be transformed to NA
  2. any blanks whatsoever should be transformed into NA (if not done already)
  3. while the orcid is supposed to be an anyURI, if it is a string of the form 0000-etc, then can you prepend https://orcid.org/ to it
  4. transform Y,y,yes,Yes, YES to 1 and NO, No,N,n to 0 ; and T,True,true to 1 and F, False, false to 0

Automatic README when observatory repos are created

I am not entirely sure where to put this issue, so feel free to move it if necessary
All observatories should have the same README, and it can be created automatically when the repo is created (but I am not sure what action does that), and obviously should not be overwritten when the harvesting is done

The text for that should be put somewhere identified, so that when we need to change it, it can be done by the data manager rather than by the developer.

The text we want added now is:
This repository contains the EMO BON logsheets for this observatory, in various formats

  • First they are harvested from the google drive, and put in the logsheets/raw folder
  • They are filtered to contain only the date range that will be quality controled (as dictated by governance data and put in the logsheets/filtered folder
  • These are subjected to a quality control, of which reports are added to the data-quality-control folder
  • At the same time, some data transformation rules are applied and these cleaned-up logsheets are added to the logsheets/transformed folder
  • The logsheets are then turned into turtle format, following templates and they are added to the sediment and/or water folders, in the subfolders for the three tabs of each logsheet (observatory, measured, and sampling)

Also, a new observatory repo should have the following "About" description added: EMO BON observatory - logsheets

change QC for measurements to deal with ><-

There are values in the mesurements tab that are not floats, but include a < value (a maximum, because it means that this is the lowest value that the instrument could record)
And in principle some may also have a > (a minumum) and - (a range), tho I do not expect these

These have to be deal with in the following way

  • for the QC: still report a problem
  • for the transformed logsheets: copy over as-is - this means that ALL measurements have to be recorded as strings in logsheets_schema_extended
  • for the ttl - @laurianvm or @bulricht to change the procedure so that: If there is a "< float" in the cell, put the float as the value in the ttl for that parameter, and indicate that it is a maximum value; if there is a "> float" then put the float as the value and indicate that it is a minumum value; if there is a "float - float" then record it as a range min to max.
  • We will need to document that this is done because it will mean that not all values for measurements are described in the same way and the VRE code that may want to do maths on these data will run into problems. To consider: perhaps make ALL values at least a range, from min to max?
  • for the ENA workflow, as it will not accept strings, we will have to NOT include that value in the xml file BUT we will have to instead create a new entry called "[measurement name] additional info" and copy the string into there. See emo-bon/ena-sample-registration-action#3

dealing with blanks in the logsheets

For rows that contain blank replicate types, a number of changes are needed

  1. Instead of calling them blank1,2 when repeated, call them just blank but have a _1 and _2 in the source mat id. Need to figure out how to do this - see emo-bon/observatory-hcmr-1-crate#13
  2. the only relevant information from the sampling tab for blanks are the following:
    scientific name --> it's mandatory, so yes. Now it is "unidentified" but I think this is wrong. I added an issue about it.
    tidal stage --> optional, so no. Also, not relevant.
    depth --> mandatory, so yes. For now, I have used 0 (there has to be a number).
    sampl collect device --> optional, so no. Also, not relevant.
    samp mat process and deviations --> optional, but I would say yes.
    membr cut --> optional, but I would say yes.
    size frac low and up --> optional, but I would say yes.
    time fi --> optional, so no. Also, not relevant.
    env_material --> it's mandatory, so yes. Right now, for the sediment blanks it is "swab" for the water blanks it is "Milli-Q water" (based on the SOPs).
  3. no measurements should be added for these samples

We will need a separate function to deal with blanks for the transformed logsheets OR just for the ttl template. To be taken up after my summer break

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.