Git Product home page Git Product logo

buzsaki-lab-to-nwb's Introduction

buzsaki-lab-to-nwb

NWB conversion scripts for popular datasets. A collaboration with Buzsáki Lab.

Clone and dev install

$ git clone https://github.com/catalystneuro/buzsaki-lab-to-nwb
$ pip install -e buzsaki-lab-to-nwb

Alternatively, to clone the repository and set up a conda environment, do:

$ git clone https://github.com/catalystneuro/buzsaki-lab-to-nwb
$ cd nwb-conversion-tools
$ conda env create --file make_env.yml
$ conda activate buzsaki-lab-to-nwb-env
$ pip install .

Workflow

Here is a basic description of the standard conversion pipeline for this project.

Download local data

These datasets can sometimes be multiple TB in total size, so direct download to local devices for conversions is not recommended. Instead, we'll use a remote server for full runs of scripts developed and tested locally - more on that in the final steps.

For local debugging, it is recommended to download one random session from each subject of the dataset. You can find the globus endpoint here: https://buzsakilab.com/wp/ -> Databank -> Globus Datasets.

For example, the PetersenP dataset has subject subfolders MS{x} for x = [10, 12, 13, 14, 21, 22].

Each of these contains multiple sessions for each subject, of the form Peter_MS{x}_%y%m%d_%h%m%s where the remaing details are datetime strings of the start time for each session. There may also be additional strings appended to the session name.

For prototyping, we would download a randomly chosen session for each subject. If none of these contain any raw data (.dat files), I would recommend specifically finding a session that does contain some so that it is included in the prototyping stage.

In some cases, there may be more subjects or sessions included in the globus dataset than were used in the corresponding publication; start confirming this by going through the methods or supplementary section of the corresponding paper to see if those details are included. If not, send an email to the corresponding author to obtain a list of sessions used for final analysis. A good example of this in the PetersenP dataset is that MS14 was not actually used in the publication even though there is data available for it; thus, we will skip this mouse when converting the dataset.

Build converter class

From nwb-conversion-tools, construct an NWBConverter class that covers as many of the data types available in the dataset. For example,

class PetersenNWBConverter(NWBConverter):
    """Primary conversion class for the PetersenP dataset."""

    data_interface_classes = dict(
        NeuroscopeRecording=PetersenNeuroscopeRecordingInterface,
        NeuroscopeLFP=PetersenNeuroscopeLFPInterface,
        PhySorting=PhySortingInterface,
    )

We will add more interfaces later as we develop them custom to each experiment.

Build conversion script

Construct a script that instantiates the converter object as well as specifies any other dataset metadata that applies to each session. It is recommended to apply paralleziation at this stage as well. This will be the primary way you end up running the conversion in the final steps.

These can honestly just be copied & pasted from previous conversions, such as https://github.com/catalystneuro/buzsaki-lab-to-nwb/blob/add_petersen/buzsaki_lab_to_nwb/petersen_code/convert_petersen.py, with all dataset specific naming and descriptions updated to correspond to this particular conversion.

Be sure to keep stub_test=True as a conversion option throughout this step in order to debug as quickly as possible.

Build specialized data interfaces for new data

This is where most time will be spent; designing a DataInterface class, in particular the run_conversion() method, for data not covered by those inherited from nwb-conversion-tools. This most often includes behavioral data such as trial events, states, and position tracking. Further, previous datasets rarely tend to use the same exact method of storing these data so I/O has to be developed from scratch for each new file.

Remote server

Now it's time to download all the available data from the endpoint onto the remote server for conversion. It's critical this data go onto the mounted drive of /mnt/scrap/catalystneuro.

When download is complete, try to run the full conversion with stub_test still set to True. Occasionally certain bugs only show up during this stage as they may correspond only to a handful of sessions in the dataset.

When all tests are passing with stub_test=True, investigate some of the NWB files with widgets or other viewers to ensure everything looks OK. Post some on the slack as well so Ben and Cody can approve.

Once approved, set stub_test=False and begin the full conversion with parallelization options set to maximum of 12 cores and ?? (I need to check) RAM buffer per job.

After it is complete, double check the NWB files with widgets to make sure the full conversion went as expected; if everything looks good, create a new DANDI set, fill in requisite metadata, and proceed with upload to the archive.

buzsaki-lab-to-nwb's People

Contributors

bendichter avatar codycbakerphd avatar dependabot[bot] avatar garrettmflynn avatar h-mayorquin avatar luiztauffer avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

wuffi

buzsaki-lab-to-nwb's Issues

Update contents of `make_env.yml` to match `requirements.txt`

The contents of the file make_env.yml seem to be outdated and out of synch with requiremens.txt:

make_env.yml

name: convert_to_nwb
channels:
- defaults
- anaconda
- conda-forge
dependencies:
- python=3.7
- ipython
- pip
- numpy
- scipy
- jupyter
- matplotlib
- cycler
- h5py
- pynwb
- hdmf
- pip:
- ndx-grayscalevolume
- git+https://github.com/NeurodataWithoutBorders/nwbn-conversion-tools
- -e .

requirements.txt

pynwb==1.4.0
tqdm==4.60.0
numpy==1.19.3
pandas==1.2.3
scipy==1.4.1
h5py==2.10.0
hdf5storage==0.1.18
PyYAML==5.4
jsonschema==3.2.0
psutil==5.8.0
lxml==4.6.3
spikeextractors==0.9.7
spikesorters==0.4.4
spiketoolkit==0.7.5
neo==0.9.0
roiextractors==0.3.1
nwb-conversion-tools==0.9.3
mat4py==0.5.0
mat73==0.52

Is there a reason for this mismatch?
I am intending to update the make_env.yml file to make it match as I have been using it in dandi to run our conversion.

Can I move forward with this @CodyCBakerPhD ?

First ValeroM conversion

Past project, data collected and paper published.

Paper: Probing subthreshold dynamics of hippocampal neurons by pulsed optogenetics

Data: Buzsaki Globus, ValeroM folder

Sessions
datasets/ValeroM/fCamk1/fCamk1_200827_sess9
datasets/ValeroM/fCamk1/fCamk1_200901_sess12
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200902_sess13
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200904_sess15
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200908_sess16
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200909_sess17
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200910_sess18
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200911_sess19
datasets/ValeroM/unindexedSubjetcs/fCamk1/fCamk1_200911_sess19
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201028_sess10_cleanned
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201029_sess11_cleanned
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201030_sess12
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201103_sess14
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201111_sess20
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201113_sess22
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201105_sess16
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201106_sess17
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201110_sess19
datasets/ValeroM/unindexedSubjetcs/fCamk3/fCamk3_201109_sess18
datasets/ValeroM/unindexedSubjetcs/fCamk5/fCamk5_210406_sess10
datasets/ValeroM/unindexedSubjetcs/fCamk5/fCamk5_210407_sess11
datasets/ValeroM/unindexedSubjetcs/fCamk5/fCamk5_210408_sess12
datasets/ValeroM/unindexedSubjetcs/fCamk5/fCamk5_210414_sess16
datasets/ValeroM/unindexedSubjetcs/fCamk5/fCamk5_210412_sess14
datasets/ValeroM/unindexedSubjetcs/fCamk5/fCamk5_210415_sess17
datasets/ValeroM/unindexedSubjetcs/fCamk2/fCamk2_211012_sess1
datasets/ValeroM/unindexedSubjetcs/fCamk2/fCamk2_211013_sess2
datasets/ValeroM/unindexedSubjetcs/fCamk2/fCamk2_201014_sess3
datasets/ValeroM/unindexedSubjetcs/fCamk2/fCamk2_201015_sess4

Other info:

The data from the published paper comes from 4 mice (CamKIIa)::Ai32 which were implanted with uLEDs probes (32ch with 12 light miniaturized LEDs) both in their home cages and in a linear maze. A unique aspect of the dataset is the pattern of light stimulation that we used (short and low intensity pulses in randomly assigend LEDs during ~3h per session). This is specified in two mat file named sessionName.pulses.events.mat. Let me know if you have any questions about this.

neuroscope helper function data flow

Though what is currently used works fine as is, there is a decent amount of direct data retrieval from source files such as the .XML within the neuroscope.py helper functions which obtains data that is also retrieved in the same fashion but stored previously within upstream metadata of the converter.

Such occurrences are computationally redundant and the helper functions should instead receive the values as passed input from the metadata, which should grab all relevant values prior to running the conversion.

port buzsaki conversion code

The Buzsaki lab uses Neuroscope format to store their raw data right now, the code to convert their data is here: https://github.com/ben-dichter-consulting/to_nwb/tree/master/to_nwb/Buzsaki, mostly in convert_yuta.py. I'd like to move this over to this repo. We should use the Neuroscope Extractor where we can, but we will also need to port much of the custom code for behavior as well. We should also use the nwb conversion gui to allow the user to manually manage the meta-data

neuroscope data sharing with SortingExtractor

There are occasional references throughout the neuroscope helper functions to data contained within the .clu and .res files.

This information has already been accessed, retrieved, and stored within the SortingExtractor object prior to running the conversion, and these helper functions should therefore utilize this rather than re-accessing the files again since this access can only be done via readtxt (not lazily).

  • One exception to this may be waveforms in the .spk.%i files as SpikeEventSeries whose indexing corresponds to the per-shank (not per-unit) .res files; of course, once the future version of waveforms as jagged array unit columns is implemented, we can just include that in SpikeInterface and would become a non-issue for that.

replace n_channels value retrieval

Currently, many situations that require the value of the total number of channels in the recording (.dat or .eeg) retrieve the value via procedures analogous to below

            n_channels = len([[int(channel.text)
                              for channel in group.find('channels')]
                              for group in root.find('spikeDetection').find('channelGroups').findall('group')])

whereas this parameter is also simply within the same .XML file the information is drawn from and should be retrieved by that O(1) operation. Will simply take a little bit of time to go through this repository as well as to_nwb.neuroscope to find them all.

setup.py needs to be checked

Needs to be tested within new python environments to ensure successful installation of dependencies. Will check on this once the other PR's are pulled to guarantee the correct list of dependencies.

First HuszarR conversion

Past project, data collected and paper published.

Paper: Preconfigured dynamics in the hippocampus are guided by embryonic birthdate and rate of neurogenesis

Data: Buzsaki Globus, HuszarR folder, everything under optotagCA1

Other info:

The data was all recorded from mouse CA1- they are long recordings, many sessions have behavior on a familiar T-shaped maze, there are also animals that were exposed to both familiar and novel environments.
Some of the sessions were recorded with dual sided probes from Diagnostic Biochips, which gave us really good cell yield. One session has up to 480 cells, which is really great for ipsilateral CA1.
Sessions with these high cells yields (we also have more that haven't been spike sorted) might be especially useful for the wider community.

Another unique aspect of the dataset is that it includes information about the embryonic birthdate of pyramidal neurons, which (as it turns out) constrains the activity patterns across brain states (the paper is all about this).
The data is organized according to the embryonic birthdate of the cells, e.g, folder "e15" will have all animals where neurons of embryonic day 15 were targeted. There were 6 such animals (e.g., "e15_13f1"), and the folder of each holds the recording sessions that contributed to the analysis in the paper.
In the supplementary information of the paper (table 1, also attached), I describe what manipulations were done for each animal- it's not always the same.

In my case, all the data is in the folder HuszarR/optotagCA1, and it is primarily CellExplorer and buzcode-related formatting.
I did not include any LFP data, although the ripple files include a single channel of LFP from the middle of the pyramidal layer. This works well for detection of events, but for finer timing-related LFP analysis (e.g., phase of hippocampal theta), this might be insufficient.
Also I did not include any of the raw data in these folders.

I am talking to them about getting the raw data included in this conversion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.