DPchecker (Data Package checker) is a package with a series of functions for NPS data package authors and reviewers to check for internal consistency among data/meta data and with the data package standards
For running and reporting congruence test results. Could be useful, or maybe it's better to just make a vignette instead (esp. if we're able to use pkgdown sites in github enterprise)
Allowed: alphanumeric and underscores
In addition, column names must start with a letter (not a number or underscore) because R and other languages don't like that
Can probably use @inheritparams to avoid writing redundant param descriptions
Recommend using here::here() instead of getwd(), and avoid using setwd() (see example in pull request). It adds another package dependency, but this is a good package for everyone to have and use. NOTE: using here::here() should be fine to use as a default function arg, but its documentation says it's intended for interactive use so we shouldn't use it inside of package functions.
Functions should probably return true/false to indicate pass/fail, or do it testthat style and return the error description for failures and nothing for passes.
In load.data, I don't think assign() is necessary
Consider adding a single function that runs all congruence tests, and/or a simple Rmd congruence report template
Currently it throws an error if there is zero delimiter info in metadata file, but ignores the case when there is delimiter info for some data files but not others
Does/should this function check for whether these codes are used in the data? I see no harm in having a missing code of say "NA" that does not occur in the data if no data are missing. There isn't really a way to determine whether missing data codes not listed in the metadata are used in the data (with the exception of non-numeric missing data codes in numeric data columns, which we already test for).
Additional caveat/hurdle: how to deal with blank cells as missing data? I think R will just all these with is.NA() being true where as if someone puts NA in a column R interprets this as "NA" with is.na(NA) being false. If I recall correctly there's also something funny about whether NAs are in the first several rows or not and how that influences how R interprets them.
Consider updating test_date_range to only consider good data and ignore any dates flagged as rejected. Realistically, flags may vary across networks, so maybe provide arguments that let the user specify the name of the flag column and the codes that indicate rejected data
── Checking metadata compliance ──
✔ Your metadata is schema valid.
✔ Each data file name is used exactly once in the metadata file.
✔ Your EML version is supported.
✔ Metadata indicates that each data file contains a field delimiter that is a single character
✔ Metadata indicates that each data file contains exactly one header row.
✔ Metadata indicates data files do not have footers.
✔ Metadata contains taxonomic coverage element.
✔ Metadata contains geographic coverage element
✔ Metadata contains a digital object identifier.
✔ Metadata contains publisher element.
Error in lapply(text, glue_cmd, .envir = .envir) : object 'e' not found
Probably easiest to put together a small dummy dataset but I'll need a little help making sure I'm including the right things, esp. when creating the metadata.
I think NPSutils is the right place for this? @RobLBaker if so, let me know whenever you push your latest changes to it so I can copy the fxn in without creating merge conflicts. No urgency on this.
test_numeric_fields(directory=here::here("SET"), metadata)
✔ Columns indicated as numeric in metadata contain only numeric values and valid missing
value codes.
Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)
Looking at a couple of metadata files, it looks like the version number appears in a few places at the start of the xml doc (example below). Do we want to check all of those for consistent/correct EML version, or do we just want to look at one?
Currently the test_file_names() function uses grepl to find all "objectName" elements; this may be too aggressive:
There could be objectName elements that are NOT data files (i.e. otherEntity). This would cause the test to fail (when perhaps it shouldn't?) as they would not be in the data package directory as .csvs.
Alternatively, there could be data files that have been mistakenly added to the metadata as otherEntity rather than as dataTables. The current 'test_file_names()' does not throw an error and the check passes, even though it should definitely not pass and should be a major error.
Write a test that looks for all required metadata elements and tells which ones are missing.
list of current required elements:
title (eml/dataset/title)
pubdate (eml/dataset/pubDate)
for or by nps (eml/dataset/additionalMetadata/agencyOriginated/byOrForNPS)
publisher (eml/dataset/publisher/organizationName)
publisher location (eml/dataset/publisher/address/city; eml/dataset/publisher/address/administrativeArea)
abstract (eml/dataset/abstract)
??? (eml/xmlns:eml)
file name (eml/dataset/dataTable/physical/objectName)
file description (eml/dataset/dataTable/entityDescription)
CUI dissemination code
license (eml/dataset/licensed/licenseName) - and make sure agrees with CUI code
intellectual rights (don't recheck text, just that it is present)
eml/dataset/dataTable/attributeList/attribute/attributeName
eml/dataset/dataTable/attributeList/attribute/attributeDefinition
eml/dataset/dataTable/attributeList/attribute/storageType
It looks for csv files in the data folder with identical names, which isn't actually possible (as noted in the fxn description). I think keeping it might just confuse people? Also I'm too lazy to write a unit test for it.
It adds another package dependency, but this is a good package for everyone to have and use. NOTE: using here::here() should be fine to use as a default function arg, but its documentation says it's intended for interactive use so we shouldn't use it inside of package functions.
New EML file and a new little hiccup. Most of the way through the checks a parsing issue appears and a missing element is mentioned. It then completes the summary, but shows a warning and error that were not printed previously.