Git Product home page Git Product logo

nwbinspector's Introduction

PyPI version ReadTheDocs Tests codecov License

Inspect NWB files for compliance with NWB Best Practices. This inspector is meant as a companion to the PyNWB validator, which checks for strict schema compliance. In contrast, this tool attempts to apply some common sense to find components of the file that are technically compliant, but possibly incorrect, suboptimal in their representation, or deviate from best practices. This tool is meant simply as a data review aid. It does not catch all best practice violations, and any warnings it does produce should be checked by a knowledgeable reviewer.

Installation

pip install nwbinspector

Usage

# supply a path to an NWB file
nwbinspector path/to/my/data.nwb

# supply a path to a directory containing NWB files
nwbinspector path/to/my/data/folder/

nwbinspector's People

Contributors

alessandratrapani avatar bendichter avatar codycbakerphd avatar dependabot[bot] avatar dysprague avatar h-mayorquin avatar jwodder avatar oruebel avatar pre-commit-ci[bot] avatar rly avatar tomdonoghue avatar weiglszonja avatar yarikoptic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nwbinspector's Issues

[Feature]: parallelize over NWB files

What would you like to see added to the NWBInspector?

the -j --njobs flag should control number of simultaneous jobs that run inspector over NWB files in parallel

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[Add Check]: check session start time

Should probably check the session start time has been set properly within a file; not just as a datetime object, but as a valid date range for performing neurophysiology studies. E.g., if session_start_time <= datetime(1970, 1, 1) print "Session start time may not be set to the true date of the recording."

[New Check]: Check Subject age or dob

What would you like to see added to the NWBInspector?

I'm getting this:

1.1.2   Subject 'subject' located in ''
        check_subject_age: Subject is missing age.

even if the Subject has date_of_birth defined.
I think check_subject_age should be changed to check_subject_age_or_dob

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[Add Check]: entire col/row of table is not NaN, and entire axis of TimeSeries.data is not NaN

What would you like to see added to the NWBInspector?

We were in the process of making something like this inspired by the previous commented code, but was also discovered in a recent manual inspection.

Basically, an entire row or column of a DynamicTable, or an entire axis of a TimeSeries, should not all be NaN.

There are some exceptions that we have to observe such as x,y,z for ElectrodeTable. Basically any field that is required schematically but unknown in practice, we'd have to skip. But for axes optionally added or certainly custom fields, they should not be included if they are unknown.

Pseudocode or other logic of the check

all(np.isnan(getattr(obj, "data")))

Do you have any interest in helping implement the check function?

Yes.

Code of Conduct

[Add Check]: unknown or default start times

What type of check would you like to see added to the NWBInspector?

At times in the past, if a starting_time for some TimeSeries was unknown we set it to np.nan to indicate this fact. I just wanted to raise the question here officially to see if this is indeed encouraged or if the default behavior makes sense.

It's not stated in the best practices, and default behavior seems to assume that it's always 0.0, but especially in cases where multiple data streams are incoming to the NWBFile this may not be correct due to synchronizing offsets.

Pseudocode or other logic of the check

No response

Do you have any interest in helping implement the check function?

Yes.

Code of Conduct

Provide "structured" machine readable output/return

A companion to #76 but largely orthogonal.

Although we have not standardized in dandi-cli much either, and having looked into nwbinspector insides, but keeping in mind dandi/dandi-cli#924, i.e. integration with dandi-cli, I think it would be great if internally all the "hints/warnings/whatever" were represented as some basic data structure (well -- dict) so from CLI I could request just a dump of those hints in machine readable form (e.g. JSON). Default "rendered" of the output could format them as it does now. And then dandi-cli code could use nwbinspector as a library and just get a list of those records for the path(s) it inquire about.

To some degree such an idea of "alternative output formats" could be seen in dandi ls --format option which provides a range of those. Also, datalad inspired folks could use -f with any datalad command ;)

[Documentation]: Coordinate description for Allen CCF v3

What would you like changed or added to the documentation and why?

The current Best Practices (ported from the old ones): https://nwbinspector.readthedocs.io/en/dev/best_practices/ecephys.html#anatomical-coordinates

indicates that for Allen Institute CCF v3, (+x = posterior, +y = inferior, +z = right).

  1. Is there a documentation I can link specifically to this for the x/y/z conventions? I'm currently only linking to the general CCF page there.
  2. 'right' seems subjective to a dorsal or ventral view, is there a better way we can phrase that?

Do you have any interest in helping write or edit the documentation?

Yes.

Code of Conduct

refactoring

Let's think about how to make this more extendable. How about this: We have a Check class, and this defines:

  1. what neurodata type it is checking
  2. the test
  3. the severity

then nwbinspector simply loops through all neuodata objects, tests which checks apply to each object, and runs them.

NWBInspector would also have a method for adding or removing Checks

yield inspector messages

I am working on a test for making sure that all time columns of a TimeIntervals object as ascending. This is a situation where a single object could throw multiple messages, one for each column. I think it would be useful to have the check functions yield the result, which would allow us to iterate over them.

[Minor performance]: Use Caching for CI environments

The most common usage of cache actions is to save the environment used to run tests so setup is not necessary.

This is always recommended to be done by hashing the current versions of the requirements.txt so you only do fresh installs when the versions change.

Will make the CI even faster, and reduce matrix-strategy computational waste (so we could really easily extend to 3.8 testing as well).

why check unique vals <= 4?

This line is cluttering up the inspect output:

elif len(uniq) <= 4:
print("NOTE: '%s' %s data has only unique values %s" % (ts.name, type(ts).__name__, uniq))

I get why we are detecting a single unique value or two because it could be a boolean, but should we print anything for 3 or 4 unique values? What is the thinking behind this?

[Discussion]: Add Error Codes?

As per breakout session discussion, it may be worth it to re-examine the question of associating explicit error codes to check functions.

Main point is this could (a) simplify skipping references, (b) allow us to change underlying function name but keep error code the same.

[Feature]: an auto-fixer for missing metadata

What would you like to see added to the NWBInspector?

An idea that came up during a recent meeting with some BIDS folk would be to have a nice user-friendly way of automatically fixing common issues that come up during a run of the NWBInspector.

Clearly, some types of Best Practice violations, such as wrong data orientation, no data compression/chunking, etc. would require a complete reconversion of the data, so are not applicable here.

Likewise, the overlay/sidecar project Ryan is working on would be required for editing the form of any metadata violations (like ISO 8601, subject naming conventions, etc.) for fields that had already been written to the file.

There is, however, one case where this sort of feature could be quite useful: missing metadata

The feature would be as follows:

  1. [Optional] A fully interactive CLI would ask the user to explicitly type the values for each of the missing metadata fields that the NWB Best Practices suggests including.
  2. At the end of the interactive session (or written manually), a metadata .yml file (with structure very similar to how we do this in the NWBCT) would be saved and the user would be asked to review it one more time and confirm the values within it are accurate.
  3. With one more CLI call, the user can run the auto-fixer on the saved .yml file, which would result in all the missing metadata from the .yml being written to the NWBFile(s) in append mode, and deleting the source .yml file to keep things clean (only if no errors occurred during the process). Basically the idea is to 'fold' the .yml content into the NWBFile.
  4. [Future] When sidecars are available, this process can harness them to cover not just missing metadata, but also existing text fields that need to be updated.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[Feature]: per dataset configuration file to skip/ignore false positives

What would you like to see added to the NWBInspector?

Originally requested in #37 . I think that the implementation in #86 went on a different "tangent" from what my bids-validator-rotten mind had in mind: e.g. given the "stock" (provided by nwbinspector) dandi configuration I want to disable some tests/checks, e.g. reacting to some benign check_data_orientation as given in the example of dandi/handbook#40 .

Then I would have created some .nwbinspector.cfg or alike which would have instructed to ignore such errors if nwbinspector was ran in this folder.

Also #86 has TODO: expose to command line -- was that done?

Do you have any interest in helping implement the feature?

No.

Code of Conduct

[Minor performance]: organization and main loop iterations

Testing out this new button on GitHub - when you go into the ... options in a GitHub discussion and select 'raise in new issue'.

Anyway, this is super low priority ATM, but leaving here for future reference. Especially now we have parallelization functioning well, performance per job is less of an issue.

An analogous approach can also be made to improve the main iteration loops for the NWBFile objects vs. applicable neurodata_types in inspect_nwb.

From Ben's original posting:

how about this. I create a objects_by_attr dict so that the objects are only looped over once per recursion

from collections import defaultdict

def hier_org(objects, levels, reverse=None):
    if reverse is None:
        reverse = [False] * len(levels)
    unique_attrs = sorted(set(getattr(object, levels[0]) for object in objects))
    if reverse[0]:
        unique_attrs = unique_attrs[::-1]
    objects_by_attr = defaultdict(list)
    [objects_by_attr[getattr(obj, levels[0])].append(obj) for obj in objects]
    if len(levels) > 1:
        return {attr: hier_org(objects_by_attr[attr], levels[1:], reverse[1:]) for attr in unique_attrs}
    else:
        return {attr: objects_by_attr[attr][0] for attr in unique_attrs}
    
    

# test
from collections import namedtuple

Obj = namedtuple("Obj", "a b c")

objects = []
for a in range(3):
    for b in range(3):
        for c in range(3):
            objects.append(Obj(a=a, b=b, c=c))
            
hier_org(objects, ["a", "b", "c"], [True, False, True])

Originally posted by @bendichter in #89 (comment)

Check that dimension lengths are aligned

e.g. number of elements in ElectricalSeries.electrodes should be the same as number of 2nd dim of ElectricalSeries.data, number of elements in TimeSeries.timestamps should be the same as 1st dim of TimeSeries.data

[Add Check]: Slashes in object names

What would you like to see added to the NWBInspector?

Check that the name field of all objects does not include slashes ('/' or '\').

Not really sure what the neurodata_type should be for this, though. Maybe have it act on anything that is a subclass of a NWBContainer, and then the logic of the test scans all the items inside that container?

Or just have it be a recursive function that starts at the NWBFile container level and then recurses down any objects that are themselves containers?

It's already in the Best Practices: towards the bottom of https://nwbinspector.readthedocs.io/en/dev/best_practices/naming.html#neurodata-types (which will get cleaned up in the next week)

Pseudocode or other logic of the check

No response

Do you have any interest in helping implement the check function?

Yes.

Code of Conduct

Check Subject ID

In advance of DANDI intention flag feature, need to add a check that subject ID is specified. Without DANDI intention this is probably just a BEST_PRACTICE_SUGGESTION, but will get elevated to CRITICAL when run with intention=DANDI_UPLOAD or DANDI_PUBLISH.

Check Subject Species is (a) specified and (b) follows some ontology

In advance of DANDI intention flag feature, need to add a check that subject species is specified. Without DANDI intention this is probably just a BEST_PRACTICE_SUGGESTION.

Also need to add a check that if subject species is specified, it is specified using binomial nomenclature (with proper capitalization as well). Without DANDI intention this is probably a BEST_PRACTICE_VIOLATION since they did take the time to fill it out. There are some special cases that might be OK, like something we've done before is if it's a special sub-species like 'Rattus norvegicus - Long Evans'. Or it looks like DANDI also supports URL's for taxonomies: https://github.com/dandi/dandi-cli/blob/5229292b777f644c5620bddf2abd4738168f0e81/dandi/metadata.py#L477-L486

But both of these will get elevated to CRITICAL when run with intention=DANDI_UPLOAD or DANDI_PUBLISH.

[Documentation]: MatNWB UUID alternative?

What would you like changed or added to the documentation and why?

@bendichter or @lawrence-mbf In the Best Practices for the NWBFile identifier, I instruct PyNWB users towards the UUID project. What is the supported analog in MatNWB and do you have a link to a tutorial or other instructions that I can include?

Do you have any interest in helping write or edit the documentation?

Yes.

Code of Conduct

Check Subject Age is (a) specified and (b) ISO8601

In advance of DANDI intention flag feature, need to add a check that subject ID is specified. Without DANDI intention this is probably just a BEST_PRACTICE_SUGGESTION.

Also need to add a check that if subject ID is specified, it is specified in ISO8601. Without DANDI intention this is probably a BEST_PRACTICE_VIOLATION since they did take the time to fill it out.

But both of these will get elevated to CRITICAL when run with intention=DANDI_UPLOAD or DANDI_PUBLISH.

Add badges to README

Just need to add the little badge displays for version, tests, coverage, etc.

[Feature]: helper tool for running directly on a DANDISet ID #

What would you like to see added to the NWBInspector?

We've been running the inspector on DANDISets using the ros3 mode for a while now, but a lot of it is still manually specified code for resolving the s3 paths. It's pretty easy to do automatically though so might as well port over the code so we can just call a nice simple function directly each time.

Only concern is if it might produce circular dependencies when the dandi-cli merges the integration with the inspector.

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

[Add Check]: validating video files under external_file attribute.

What would you like to see added to the NWBInspector?

Video files that are linked to nwbfiles under ImageSeries.external_file should be inspected for:

  • Checking their existence on disk with the path specified. (warning importance high). This check would defer for paths for local disk location vs urls.
  • Whether video files are of the common formats: .avi, .mp4, .mov, .wmv, .flv. This can be a mild recommendation (low importance)
  • if external_file is a local path, checking if its an absolute or relative to nwbfile location (low importance)

Pseudocode or other logic of the check

from pathlib import Path
import requests

video_file_path = "documents/folder0/video0.mp4"
video_file_url = "https://dandiarchive.s3.amazonaws.com/blobs/a52/253/a5225398-ec1c-4325-8bbe-83dc73fa8c87"

def video_exist_check(video_path, nwbfile_path):
   video_path_type = video_pathtype_check_absolute(video_path)
   if  video_path_type == "relative":
      pt = Path(nwbfile_path)/Path(video_path)
      return pt.exists()
   elif video_path_type == "url":
      return requests.head(video_path).ok
   else:
      return Path(video_path).exists()

def video_format_check(video_path):
   if Path(video_path).suffix in [".mp4", ".avi", ".wmv", ".mov", ".flv"]:
      return True
   else:
      return False

def video_pathtype_check_absolute(video_path):
   if os.path.isabs(video_path):
      return "absolute"
   elif any([str(video_path).startswith(i) for i in ["ftp://", "http://", "https://"]]):
      return "url"
   else:
      return "relative"

Do you have any interest in helping implement the check function?

Yes.

Code of Conduct

GH Action for codecov not working

  1. The upload has the warning:
Warning: Unexpected input(s) 'yml', valid inputs are ['token', 'files', 'directory', 'flags', 'aws_curl_args', 'codecov_curl_args', 'commit_parent', 'env_vars', 'fail_ci_if_error', 'file', 'functionalities', 'gcov_args', 'gcov_executable', 'gcov_path_exclude', 'gcov_path_include', 'gcov_prefix', 'gcov_root_dir', 'move_coverage_to_trash', 'name', 'network_filter', 'override_branch', 'override_build', 'override_commit', 'override_pr', 'override_tag', 'path_to_write_report', 'root_dir', 'verbose', 'working-directory', 'xcode_derived_data', 'xcode_package']
  1. https://github.com/codecov/codecov-action v1 has been deprecated. We should use v2 now. I think this is why https://app.codecov.io/gh/NeurodataWithoutBorders/nwbinspector is not showing data.

[Bug]: CLI config error-catching

What happened?

Also observing a very bizarre behavior where if there are actual Errors (like a ValueError) triggered within one of the test functions (see PR #106 that fixes the ones I'm seeing), they aren't grabbed within the try/except of run_checks and included in the report, but instead are raised to console as true errors:

image

but if I don't include the --config dandi they behave as expected and go into that extra section of the final report.

Steps to Reproduce

From a console with working directory located in a folder that has dandiset 000004 (or ros3, alternatively)...


nwbinspector ./000004 --config dandi

and compare to...

nwbinspector ./000004

Traceback

Screenshot in the first section; particulars of the traceback don't really matter, it's the fact that it IS a traceback pushed to the console as the standard out instead of getting grabbed by the report.

Operating System

Windows

Python Executable

Python

Python Version

3.9

Package Versions

versions.txt

Code of Conduct

  • I agree to follow this project's Code of Conduct
  • Have you ensured this bug was not already reported?
  • To the best of your ability, have you ensured this is a bug within the code that checks the NWBFile, rather than a bug in the NWBFile reader (e.g., PyNWB or MatNWB)?

[Feature]: specifiable output format

What would you like to see added to the NWBInspector?

Right now you can save the report file with any suffix, but what is truly written is a .txt format with .rst sectioning. It would be nice to allow multiple formats to be saved such as .md, .doc, etc.

.md would be easy as the main difference is simply the section heading of #/##/### instead of =/-/~

For .doc, it is proposed to use https://pypi.org/project/rstdoc/ ; I've tried it a bit earlier today but didn't have any luck getting it to work through the CLI. It seems to use https://pypi.org/project/pandoc/ on the backend so maybe just tapping into that library would be the way to go

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

Add flag for DANDI intended upload

Optional flag or config for DANDI alteration of importance levels of certain checks.

Idea from Dorota: warn the user that output report may appear different.

By default just print/output to screen and not into a file

Description

During recent NWB-DANDI dev hackathon, during demonstration of the nwbinspector I forgot to ask -- why does it produce/save log into a file instead of just printing it to the screen (which then could be redirected to a file if desired, or there could be an option --output-file FILENAME to make it explicit)?

[Add Check]: cell and sample id

What would you like to see added to the NWBInspector?

@satra has requested checks for more complete metadata associated with intracellular electrophysiology patch clamp experiments, including checking for cell and sample ids.

Pseudocode or other logic of the check

No response

Do you have any interest in helping implement the check function?

Yes, but I would need guidance.

Code of Conduct

problem with compression test

This is showing up for many datasets:

BEST PRACTICE VIOLATION
-----------------------
1.1.1   NWBFile 'root' located in '/'
        check_dataset_compression: Consider enabling compression when writing a large dataset.

It's not clear what dataset is causing the violation.

[Feature]: conda-forge feedstock

What would you like to see added to the NWBInspector?

In light of nwbinspector becoming a dependency of dandi-cli in https://github.com/dandi/dandi-cli/, nwbinspector better becomes available for conda distribution within conda-forge channel so we would be able to continue having dandi client package there for the next release too.

@jwodder , since you have experience with creating feedstock recipes, would you be so kind to prepare one for nwbinspector?

Do you have any interest in helping implement the feature?

Yes.

Code of Conduct

Add "config file" support

not yet sure about all "features" but such config file should support but it must allow to specify

  • which checks for which paths to ignore

This is largely inspired by config file in bids-validator, see https://github.com/bids-standard/bids-validator#configuration

It is needed since some of such ignores are "per dataset" and thus it should be possible to specify what to ignore in a specific dataset (or dandiset) in DANDI land, instead of relying on user to know what command line options to provide.

Rename master to dev

For consistency with the other NWB repos it would be nice to rename the master branch to dev

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.