Git Product home page Git Product logo

nationalparkservice / dpchecker Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 1.0 5.21 MB

DPchecker (Data Package checker) is a package with a series of functions for NPS data package authors and reviewers to check for internal consistency among data/meta data and with the data package standards

Home Page: https://nationalparkservice.github.io/DPchecker/

License: Other

R 100.00%
datastore ecological-meta-data eml national-park-service nps r schema

dpchecker's People

Contributors

roblbaker avatar wright13 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

roblbaker

dpchecker's Issues

check metadata for storage type

eml/dataset/dataTable/attributeList/attribute/storageType

-Error if empty/missing
-Do anything else with this? More tests?

Consider adding Rmd template

For running and reporting congruence test results. Could be useful, or maybe it's better to just make a vignette instead (esp. if we're able to use pkgdown sites in github enterprise)

Sarah comments

  • Can probably use @inheritparams to avoid writing redundant param descriptions
  • Recommend using here::here() instead of getwd(), and avoid using setwd() (see example in pull request). It adds another package dependency, but this is a good package for everyone to have and use. NOTE: using here::here() should be fine to use as a default function arg, but its documentation says it's intended for interactive use so we shouldn't use it inside of package functions.
  • Functions should probably return true/false to indicate pass/fail, or do it testthat style and return the error description for failures and nothing for passes.
  • In load.data, I don't think assign() is necessary
  • Consider adding a single function that runs all congruence tests, and/or a simple Rmd congruence report template

How should we handle read_csv errors in load_data?

read_csv can throw warnings/errors when csv files aren't in the right encoding and/or when it guesses column types wrong.

Encoding

  • I believe we're requiring/strongly encouraging UTF-8, is that correct?
  • Is this something we can check for?

Column types

  • R guesses column type based on the first x rows.
  • The brute force way to fix this is to force it to look at all of the rows, but I'm not sure how much this slows things down.
  • We can also set column types based on the metadata, but that breaks if the files and attributes don't match the data
    • If we go this route, the order of the checks matters

test missing data codes/definitions

Write a function that:

  1. checks for missing data codes
  2. checks for missing data code definitions

Does/should this function check for whether these codes are used in the data? I see no harm in having a missing code of say "NA" that does not occur in the data if no data are missing. There isn't really a way to determine whether missing data codes not listed in the metadata are used in the data (with the exception of non-numeric missing data codes in numeric data columns, which we already test for).

Additional caveat/hurdle: how to deal with blank cells as missing data? I think R will just all these with is.NA() being true where as if someone puts NA in a column R interprets this as "NA" with is.na(NA) being false. If I recall correctly there's also something funny about whether NAs are in the first several rows or not and how that influences how R interprets them.

check metadata for publisher location

eml/dataset/publisher/address/city; eml/dataset/publisher/address/administrativeArea

"administrativeArea" should hold state. Do we also want the city?

-Error if empty
-Warn if not Fort Collins, CO.

Update test_date_range to handle flagged data

Consider updating test_date_range to only consider good data and ignore any dates flagged as rejected. Realistically, flags may vary across networks, so maybe provide arguments that let the user specify the name of the flag column and the codes that indicate rejected data

Test for pubdate

pubdate (eml/dataset/pubDate)

check: ISO formatting
check: is it reasonable (say 2023 and later)

Error in lapply

Ran run_congruence_checks() on PACN Streams Data Package and I get an lapply error.

── Checking metadata compliance ──
✔ Your metadata is schema valid.
✔ Each data file name is used exactly once in the metadata file.
✔ Your EML version is supported.
✔ Metadata indicates that each data file contains a field delimiter that is a single character
✔ Metadata indicates that each data file contains exactly one header row.
✔ Metadata indicates data files do not have footers.
✔ Metadata contains taxonomic coverage element.
✔ Metadata contains geographic coverage element
✔ Metadata contains a digital object identifier.
✔ Metadata contains publisher element.
Error in lapply(text, glue_cmd, .envir = .envir) : object 'e' not found

Write unit tests

Probably easiest to put together a small dummy dataset but I'll need a little help making sure I'm including the right things, esp. when creating the metadata.

test_date_range throws error

line 586: firstDate <- arcticdatautils::eml_get_simple(metadata)
Needs to have a second argument, "beginDate"

test_numeric_fields produces warnings

test_numeric_fields(directory=here::here("SET"), metadata)
✔ Columns indicated as numeric in metadata contain only numeric values and valid missing
value codes.
Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)

How to determine EML version?

Looking at a couple of metadata files, it looks like the version number appears in a few places at the start of the xml doc (example below). Do we want to check all of those for consistent/correct EML version, or do we just want to look at one?

<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" packageId="BUIS_herps_metadata" xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" system="unknown">

bug fix: test_file_names()

Currently the test_file_names() function uses grepl to find all "objectName" elements; this may be too aggressive:

There could be objectName elements that are NOT data files (i.e. otherEntity). This would cause the test to fail (when perhaps it shouldn't?) as they would not be in the data package directory as .csvs.

Alternatively, there could be data files that have been mistakenly added to the metadata as otherEntity rather than as dataTables. The current 'test_file_names()' does not throw an error and the check passes, even though it should definitely not pass and should be a major error.

Test round 2 functions

  • test_taxonomic_cov
  • test_geographic_cov
  • test_doi
  • test_publisher
  • test_valid_fieldnames
  • test_valid_filenames
  • test_delimiter

Documentation: list DPchecker tests

The github website ought to list all the checks that DP checker does (and the order it does them).

Perhaps it should say a little about what each check tests and what a pass/fail/warn looks like.

check metadata for license

eml/dataset/licensed/licenseName

-Error if empty
-Error if not correct (can only be 1 of 3 things)
-Error if does not match corresponding CUI code
-Warn if it is restricted

test_required_metadata_elements

Write a test that looks for all required metadata elements and tells which ones are missing.

list of current required elements:
title (eml/dataset/title)

pubdate (eml/dataset/pubDate)
for or by nps (eml/dataset/additionalMetadata/agencyOriginated/byOrForNPS)
publisher (eml/dataset/publisher/organizationName)
publisher location (eml/dataset/publisher/address/city; eml/dataset/publisher/address/administrativeArea)
abstract (eml/dataset/abstract)
??? (eml/xmlns:eml)
file name (eml/dataset/dataTable/physical/objectName)
file description (eml/dataset/dataTable/entityDescription)
CUI dissemination code
license (eml/dataset/licensed/licenseName) - and make sure agrees with CUI code
intellectual rights (don't recheck text, just that it is present)
eml/dataset/dataTable/attributeList/attribute/attributeName
eml/dataset/dataTable/attributeList/attribute/attributeDefinition
eml/dataset/dataTable/attributeList/attribute/storageType

Can we get rid of test_dup_data_files()?

It looks for csv files in the data folder with identical names, which isn't actually possible (as noted in the fxn description). I think keeping it might just confuse people? Also I'm too lazy to write a unit test for it.

Replace getwd() with here::here() in default args

It adds another package dependency, but this is a good package for everyone to have and use. NOTE: using here::here() should be fine to use as a default function arg, but its documentation says it's intended for interactive use so we shouldn't use it inside of package functions.

Review fxn return values

Check for consistency and modify to conform to style guide if necessary/reasonable
For ideas, look at how testthat does it

argument "element" is missing, with no default

New EML file and a new little hiccup. Most of the way through the checks a parsing issue appears and a missing element is mentioned. It then completes the summary, but shows a warning and error that were not printed previously.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.