nationalparkservice / dpchecker Goto Github PK

DPchecker (Data Package checker) is a package with a series of functions for NPS data package authors and reviewers to check for internal consistency among data/meta data and with the data package standards

Home Page: https://nationalparkservice.github.io/DPchecker/

License: Other

R 100.00%

datastore ecological-meta-data eml national-park-service nps r schema

dpchecker's People

Contributors

Stargazers

Watchers

Forkers

roblbaker

dpchecker's Issues

Write fxn to convert EML datetime format string to R datetime format string

Only do this when automatic parser fails

YYYY -> %Y
YY -> %y
MM -> %m
MMM -> %b
DD -> %d

hh -> %H (24h), %I (AM/PM), %h (durations longer than 1 day)
mm -> %M
ss -> %S

Check that numeric fields do not have string content

Check for presence of non-numeric string when a number is expected. Needs to pass over NA's or other missing value codes. Throw error on failure.

check metadata for storage type

eml/dataset/dataTable/attributeList/attribute/storageType

-Error if empty/missing
-Do anything else with this? More tests?

check metadata for CUI dissemination code

in additional metadata

-Error if blank
-Error if doesn't correspond to matchlist
-Warn if not set to public?

Consider adding Rmd template

For running and reporting congruence test results. Could be useful, or maybe it's better to just make a vignette instead (esp. if we're able to use pkgdown sites in github enterprise)

test_doi: return DOI

If test_doi is successful, tell the user what the DOI is.

check metadata for "for or by nps"

eml/dataset/additionalMetadata/agencyOriginated/byOrForNPS)

-Error if absent.
-Warn if it is not for or by NPS?

check metadata for file names

eml/dataset/dataTable/physical/objectName

-Error if empty

Q: doesn't this test already exist?

Check for nonstandard characters in column names

Allowed: alphanumeric and underscores
In addition, column names must start with a letter (not a number or underscore) because R and other languages don't like that

Check that publisher element exists

check metadata for file descriptions

eml/dataset/dataTable/entityDescription

-Error if empty
-Warn if >10 words?

Check that dates are within stated metadata bounds

Compare date range in attribute metadata to that at dataset level. Must honor content of the required attribute "exclusive". Throw warning on failure.

Sarah comments

Can probably use @inheritparams to avoid writing redundant param descriptions
Recommend using here::here() instead of getwd(), and avoid using setwd() (see example in pull request). It adds another package dependency, but this is a good package for everyone to have and use. NOTE: using here::here() should be fine to use as a default function arg, but its documentation says it's intended for interactive use so we shouldn't use it inside of package functions.
Functions should probably return true/false to indicate pass/fail, or do it testthat style and return the error description for failures and nothing for passes.
In load.data, I don't think assign() is necessary
Consider adding a single function that runs all congruence tests, and/or a simple Rmd congruence report template

Check for nonstandard characters in filenames

Replace redundant parameter documentation with @inheritparams

How should we handle read_csv errors in load_data?

read_csv can throw warnings/errors when csv files aren't in the right encoding and/or when it guesses column types wrong.

Encoding

I believe we're requiring/strongly encouraging UTF-8, is that correct?
Is this something we can check for?

Column types

R guesses column type based on the first x rows.
The brute force way to fix this is to force it to look at all of the rows, but I'm not sure how much this slows things down.
We can also set column types based on the metadata, but that breaks if the files and attributes don't match the data
- If we go this route, the order of the checks matters

Modify test_delimiter() to recognize when individual file is missing delimiter info

Currently it throws an error if there is zero delimiter info in metadata file, but ignores the case when there is delimiter info for some data files but not others

test missing data codes/definitions

Write a function that:

checks for missing data codes
checks for missing data code definitions

Does/should this function check for whether these codes are used in the data? I see no harm in having a missing code of say "NA" that does not occur in the data if no data are missing. There isn't really a way to determine whether missing data codes not listed in the metadata are used in the data (with the exception of non-numeric missing data codes in numeric data columns, which we already test for).

Additional caveat/hurdle: how to deal with blank cells as missing data? I think R will just all these with is.NA() being true where as if someone puts NA in a column R interprets this as "NA" with is.na(NA) being false. If I recall correctly there's also something funny about whether NAs are in the first several rows or not and how that influences how R interprets them.

Check that data header row matches metadata

Compare the content of the data header row to the metadata. Throw error on failure

check metadata for publisher location

eml/dataset/publisher/address/city; eml/dataset/publisher/address/administrativeArea

"administrativeArea" should hold state. Do we also want the city?

-Error if empty
-Warn if not Fort Collins, CO.

Update test_date_range to handle flagged data

Consider updating test_date_range to only consider good data and ignore any dates flagged as rejected. Realistically, flags may vary across networks, so maybe provide arguments that let the user specify the name of the flag column and the codes that indicate rejected data

Test for pubdate

pubdate (eml/dataset/pubDate)

check: ISO formatting
check: is it reasonable (say 2023 and later)

Error in lapply

Ran run_congruence_checks() on PACN Streams Data Package and I get an lapply error.

── Checking metadata compliance ──
✔ Your metadata is schema valid.
✔ Each data file name is used exactly once in the metadata file.
✔ Your EML version is supported.
✔ Metadata indicates that each data file contains a field delimiter that is a single character
✔ Metadata indicates that each data file contains exactly one header row.
✔ Metadata indicates data files do not have footers.
✔ Metadata contains taxonomic coverage element.
✔ Metadata contains geographic coverage element
✔ Metadata contains a digital object identifier.
✔ Metadata contains publisher element.
Error in lapply(text, glue_cmd, .envir = .envir) : object 'e' not found

check metadata for attribute names

eml/dataset/dataTable/attributeList/attribute/attributeName

-Error if blank
-Do anything else with this?

Change function and argument names to comply with tidyverse style guide

snake_case, even lowercase abbreviations

Write unit tests

Probably easiest to put together a small dummy dataset but I'll need a little help making sure I'm including the right things, esp. when creating the metadata.

test_date_range throws error

line 586: firstDate <- arcticdatautils::eml_get_simple(metadata)
Needs to have a second argument, "beginDate"

Refactor load.data

assign() not necessary, I think?

Move convert_datetime_format to NPSutils pkg

I think NPSutils is the right place for this? @RobLBaker if so, let me know whenever you push your latest changes to it so I can copy the fxn in without creating merge conflicts. No urgency on this.

test_numeric_fields produces warnings

test_numeric_fields(directory=here::here("SET"), metadata)
✔ Columns indicated as numeric in metadata contain only numeric values and valid missing
value codes.
Warning message:
One or more parsing issues, call problems() on your data frame for details, e.g.:
dat <- vroom(...)
problems(dat)

Add user-facing functions to run multiple tests at a time

How to determine EML version?

Looking at a couple of metadata files, it looks like the version number appears in a few places at the start of the xml doc (example below). Do we want to check all of those for consistent/correct EML version, or do we just want to look at one?

<eml:eml xmlns:eml="https://eml.ecoinformatics.org/eml-2.2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.2" packageId="BUIS_herps_metadata" xsi:schemaLocation="https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd https://eml.ecoinformatics.org/eml-2.2.0/eml.xsd" system="unknown">

bug fix: test_file_names()

Currently the test_file_names() function uses grepl to find all "objectName" elements; this may be too aggressive:

There could be objectName elements that are NOT data files (i.e. otherEntity). This would cause the test to fail (when perhaps it shouldn't?) as they would not be in the data package directory as .csvs.

Alternatively, there could be data files that have been mistakenly added to the metadata as otherEntity rather than as dataTables. The current 'test_file_names()' does not throw an error and the check passes, even though it should definitely not pass and should be a major error.

Check if geographicCoverage is present

Test round 2 functions

Check that DOI exists

check metadata abstract

eml/dataset/abstract

-error if blank
-warn if <20 words
-warn if contains odd characters

Check that temporalCoverage element is present

Documentation: list DPchecker tests

The github website ought to list all the checks that DP checker does (and the order it does them).

Perhaps it should say a little about what each check tests and what a pass/fail/warn looks like.

check metadata for intellectual rights

-Error if absent
-Don't need to check the text against anything, just so long as there is something there.

Check if taxonomicCoverage is present

check metadata for title

title (eml/dataset/title)

Work in some best practices tests:
-Is it 10 words or less?

check metadata for publisher Name

eml/dataset/publisher/organizationName

-error if blank
-Warn if NOT National Park Service

check metadata for attribute definition(s)

eml/dataset/dataTable/attributeList/attribute/attributeDefinition

-Error if blank
-Do anything with it if it's not blank? Check for anything else?

check metadata for license

eml/dataset/licensed/licenseName

-Error if empty
-Error if not correct (can only be 1 of 3 things)
-Error if does not match corresponding CUI code
-Warn if it is restricted

test_required_metadata_elements

Write a test that looks for all required metadata elements and tells which ones are missing.

list of current required elements:
title (eml/dataset/title)

pubdate (eml/dataset/pubDate)
for or by nps (eml/dataset/additionalMetadata/agencyOriginated/byOrForNPS)
publisher (eml/dataset/publisher/organizationName)
publisher location (eml/dataset/publisher/address/city; eml/dataset/publisher/address/administrativeArea)
abstract (eml/dataset/abstract)
??? (eml/xmlns:eml)
file name (eml/dataset/dataTable/physical/objectName)
file description (eml/dataset/dataTable/entityDescription)
CUI dissemination code
license (eml/dataset/licensed/licenseName) - and make sure agrees with CUI code
intellectual rights (don't recheck text, just that it is present)
eml/dataset/dataTable/attributeList/attribute/attributeName
eml/dataset/dataTable/attributeList/attribute/attributeDefinition
eml/dataset/dataTable/attributeList/attribute/storageType

Can we get rid of test_dup_data_files()?

It looks for csv files in the data folder with identical names, which isn't actually possible (as noted in the fxn description). I think keeping it might just confuse people? Also I'm too lazy to write a unit test for it.

Replace getwd() with here::here() in default args

It adds another package dependency, but this is a good package for everyone to have and use. NOTE: using here::here() should be fine to use as a default function arg, but its documentation says it's intended for interactive use so we shouldn't use it inside of package functions.

Review fxn return values

Check for consistency and modify to conform to style guide if necessary/reasonable
For ideas, look at how testthat does it

argument "element" is missing, with no default

New EML file and a new little hiccup. Most of the way through the checks a parsing issue appears and a missing element is mentioned. It then completes the summary, but shows a warning and error that were not printed previously.