datajoint / datajoint-python Goto Github PK

Relational data pipelines for the science lab

License: GNU Lesser General Public License v2.1

Python 99.90% Dockerfile 0.10%

datajoint scientific-computing databases data-analysis pipeline-framework python relational-databases relational-algebra relational-model mysql

datajoint-python's Introduction

Welcome to DataJoint for Python!

DataJoint for Python is a framework for scientific workflow management based on relational principles. DataJoint is built on the foundation of the relational data model and prescribes a consistent method for organizing, populating, computing, and querying data.

DataJoint was initially developed in 2009 by Dimitri Yatsenko in Andreas Tolias' Lab at Baylor College of Medicine for the distributed processing and management of large volumes of data streaming from regular experiments. Starting in 2011, DataJoint has been available as an open-source project adopted by other labs and improved through contributions from several developers. Presently, the primary developer of DataJoint open-source software is the company DataJoint (https://datajoint.com).

Data Pipeline Example

Yatsenko et al., bioRxiv 2021

Getting Started

Install with Conda
```
conda install -c conda-forge datajoint
```
Install with pip
```
pip install datajoint
```
Documentation & Tutorials
Interactive Tutorials on GitHub Codespaces
DataJoint Elements - Catalog of example pipelines for neuroscience experiments
Contribute
- Development Environment
- Guidelines

datajoint-python's People

Contributors

Stargazers

Watchers

datajoint-python's Issues

Should we think of a reasonable way to overload the getitem method of base?

It would be nice if we could use that for fetching.

adapt fetch and insert

to use namedtuples
drop support of positional insert
insert with kwargs

add the possibility to have autoincrement on ids

insert

re-implement insert

Increase test coverage >68%

Because good software needs good tests

Separation of `declare` logic

As I worked through the massive changes in the logic introduced via the last merge, I noticed that the new declaration logic doesn't work well with the previous logic used in resolving module references (e.g. foreign key references) in the table definition string.

When a table definition refers to another table in another schema as in the following case

definition = """
A.Subject (manual) # subjects defined in schema A
-> B.Setups
"""

Then this reference to module B has to be resolved appropriately. Unless B is a module that's known by that very name to the Python interpreter (that is, you actually have B.py on the top level, and not inside any package), then B is not descriptive enough to be resolved down to a module that holds the definition of the target table B.Setups. To get around this problem, I have previously introduced three-step resolution strategy that utilizes the information about the module that holds this definition (in this case, the module A). The steps are as follows:

Within the module A, look to see if there is a module imported by the name B. This means that if inside the module A I have a statement like import v1_project.schemata.B as B (and therefore there is a local variable B that is a module object) then the module name B in the table definition string will resolve to the module v1_project.schemata.B, and looks for the table Setups in there.
If no such imports are found and if A actually resides in a package (so that A is actually package.A), then look for a module named B in the same immediate parent package. Thus it'll check for package.B and if found, uses that.
If neither of above 2 approaches works, then assume that B is actually a globally accessible module name, and attempt to import the module. From an organization and distribution point of view, I think it is simply unreasonable to assume that all modules with table definitions to be a top level module.

The point here is that, the resolution of any reference to a module that is not the module containing the table definition (i.e. -> B.Setups), requires the information provided by the module holding the definition (and the package it belongs to).

An alternative is of course to make the module reference in the definition explicit, such that in the above case, rather than -> B.Setups, it would read v1_project.schemata.B.Setups instead. The problem with this is that, not only now it is wordy and tedious to write, but it binds the location of the module to a strict package structure. The beauty of the previous implementation was that you can do relative reference to other schema (module), with an ability to make the target explicit via import, if preferred.

The move of making table declaration into a separate function really breaks the above logic, because it assumes that the table definition can exist independently from the module in which they are found. Unlike in Matlab where you can pull up any class name up to the top level by adding it to the path, Python modules inside a package really requires the package names to reach the module, and as far as I know there is no easy way to make a module in a package accessible at the top namespace throughout the Python interpreter, and even if there is, I'd think it'll be rather awkward.

Given these, I suggest that we place the declare and related functionality back into the Base class. I could of course make the declare function take in the name of the module that holds the definition, but if the most common use of declare is by Base derivatives, and thus pretty much always need information about the module in which the derived class is defined, I really don't see a whole lot of benefit in separating declare into an independent function.

Implement tests

Implement organized testing scheme. Thinking about using nose.

Collection of schema design patterns in docs

We should have this. Other ideas on the topic are:

give users the opportunity to submit new design patterns
write a small website on which users can generate schemata and download the code

Are we planning to support Python 2.7?

In state the code is currently in, we only support Python 3 (e.g. because we use "nonlocal" in our class design).

Do you think it is worth to support python 2? In case of the nonlocal, for example, there is no simple solution (see this for more details).

An nice start into bilingual python can be found here.

Generate documentation

Generate documentation forDataJoint semi-automatically from in-code documentations. Generated documentations should be hosted on datajoint-python project site.

implement fetch, including blobs with MATLAB arrays

Accepting user input for connection information

I noticed that in connection.py conn_container defintion, the part

input('Enter datajoint server address >> ')

was commented out. I'm guessing this was a done when configuring TravisCI, but is this necessary anymore?

Submit datajoint to PyPI

Once all the new features are added, I think datajoint should be submitted to PyPi, so it can be installed by just typing pip install datajoint.

More information on how to submit a package to the PyPi repostory can be found here.

This will also require us to think about how we are going to version the python interface of datajoint. I guess it would be the easiest to just use the same major and minor version number as in matlab to show SQL data structure compatibility.

load dependencies

Complete the implementation of the dependency loading mechanism in the connection object.

Provide documentation for the code

Now documentation generation system is in place, we should go and massively add documentations for all public interfaces of the DataJoint.

simple autopopulate (no job reservations)

port the latest implementation of the automatic table population from the MATLAB version.

ERD: entity relationship diagram

This was previously implemented using the networkx module. Make this work with the new implementation, probably as a method of Connection.

Data definition functionality

Implement data definition functionality (i.e. insert, drop table, etc) into the Base class.

Support table declaration from a database bound module inside a package

Current implementation doesn't support creating table from the doc-string in a module defined inside a package because the module full name is of the form package.subpackage.module whereas first line of declaration is expected to be modue.classname.

Should we rename `iter` in primary_iter?

I think the name would fit better. I would then use __iter__ for iterating through all tuples of a relation and return them as tuples. This would have the advantage to be able to use for loops like

for animal_id, date_of_birth, genetic_line in mice_relation:
   ...

What do you think?

cascading delete

see dj.Relvar/del on the MATLAB side.

Should we have a switch in settings that makes the library prompt for confirmation before deleting?

Primary key attributes should respect order in which they are specified

implement table declaration

Table definitions in given in Relvar class doc strings. Port the latest table declaration parser. If a relvar object cannot find its table in the database, it shall declare it upon first query to the table.

Direct database reference in definition

Should we allow the table definition to contain reference to other databases directly via the database name rather than the module name? For example, should the following definition be allowed in Base derivatives

definition = """
schema1.Subjects (manual)   # list of subjects
-> `database2`.Experimenter

where database2 is the actual name of the database under which a table named experimenter actually exists. Such referencing style is currently allowed if you directly instantiate a Table object, passing in a definition string to the constructor.

I was thinking that since we expect all dj.Base derivatives to reference each other via module.class naming conventions, it would make sense to actually prohibit direct reference to a database from within definitions for dj.Base derivatives.

Using `pandas` data structures in core implementation

Some handling of data requires us to perform operations like project and join on the fetched data structure with a data structure passed in by the user. It appears like our current go-to data structure is the numpy record array, with no such methods available (at least by default). On Matlab side, join is also no available by default but provided by datajoint as dj.struct.join. We could do this for record array but pandas package already provides DataFrame object with all such methods implemented. I believe that pandas is a pretty standard data analysis package in Python numerical computing, so may be it wouldn't be a bad idea for us to use pandas and it's data structures (DataFrame and perhaps Series) directly in our implementation.

cascading table drop

Make datajoint.Base.drop cascade.

see dj.Table/drop on the matlab side for reference.

Increase test coverage >90%

At the moment we are at 47%, which is clearly unacceptable.

add verification for set_table_comment in Base

Rewritte settings

default dictionary in datajoint.settings
dictionary derived class in datajoint.settings that has additional validators
implement validators
local file should store the local configuration. For security reasons that should not be a python module, but a json configuration file. If the json file is not there, it is created from default. An environmental variable can point to different location than local directory.
datajoint.init.py instaniates dictionary derived class as config from local config file.

blobs for arbitrary objects

Currently, blobs can only store n-dimensional numerical arrays. The data serialization is done using the mym protocol to keep the data compatible with the matlab side of things. Mym supports all MATLAB data structures: structures, structure arrays, cells, objects, etc. In practice, we only store numerical arrays in blobs. Everything else is usually normalized into their own attributes. So I am okay with deviation from the mym protocol for serializing other objects than numerical arrays. This is low priority for now since this feature is not in strong demand.

Referencing tables in own module can skip module name

Let reference to other tables in the same module (and thus schema) should allow for the module name to be skipped in the foreign reference definition. So

->own_module.TableB

can be specified simply as

->TableB

Fails to handle dependency on an existing table without a class

Current implementation will fail if a newly defined table has a dependency on a table that already exists in a schema bound to a module, but does not have a corresponding class definition in the module. This is a rather convoluted, but certainly possible situation that must be addressed.

For a concrete example, if you are defining a new table called Experiment in module mouse that refers to another existing table called microscopes in the database mouse_setups that is already bound to a module called setups, as here:

definition = """
mouse.Experiment (manual)
-> setups.Microscopes

then this will fail to create an appropriate table relation representing setups.Microscopes.

Naming conventions

Names of methods, functions and variables are currently largely in camel cases, and thus not adhering to the Python's style guide PEP 8 which advocates use of underscores in function and variable names. I think we should refactor these names to make the code more Pythonic.

AutoPopulate should prohibit inserts from outside a populate() call

automatically populated tables should never be inserted into from outside a populate call. Calling populate should enable insert. Exiting populate should prohibit insert.

Should we provide exporting non-blob data to pandas?

colons : in enum values cause declaration errors

Just as in Matlab implementation, having colon (:) in enum values causes declaration errors, but only so when default value is given for the field.

simple delete

restricted delete without cascading

ERD plotting support

The logic for creating and manipulating graphs that represent the ERD is pretty much there and can be found in the erd branch. I will keep this branch as the feature implementation branch. There are a few issues that I'm running into as I try implementing plotting support:

pygraphviz library that was once used in our project for plotting ERD does not support Python 3 natively yet. Explicitly using their beta branch with pip install pygraphviz==1.3rc2 allows us to use it in Python 3, but I'm not sure if we would like to introduce a dependency on beta-version of the package, especially as this is a very specific dependency.
pydot is an alternative package that may be used, but again there is no official Python 3 support. There appears to be a port pydot3k

The above two were the preferred method to work with graphviz graph plotting engine, that yields very nicely arranged graphs. Alternatively, I could try working with graphviz directly, but to my knowledge, the library can only render files like .pdf files at the end and it takes some effort to display this into matplotlib.

Create schemas automatically

If Connection.bind does not find the specified database on the server, it should create it automatically. This will be the user's way of creating schemas.

Also, how should we drop schemas? Maybe dropping all tables should trigger the dropping of the database.

Define default pop_rel

In 90% of cases, the pop_rel is the unrestricted join of the primary dependencies of the table. Should we set that as the default value and let users override that?

Merge Base.declaration and Base._table_def

Display relation contents

dj.Relational.__str__ should display a representation of the contents of the relation

Support insertion of blob data

AutoPopulate as an abstract subclass of Base

Implementation of features for AutoPopulate depends on the implementation of the Base. In particular, a class that derives from AutoPopulate (which is an abstract class now), will function only if:

it implements all abstract propoerties (popRel) and abstract method (makeTuples) - these two points are fine, although I'd like to rename them to fit Python naming convention better
the subclass must actually be a Base derivative to function correctly - basically we expect the subclass to inherit from Base and AutoPopulate simultaneously. This I think is conceptually messy.

Since we really don't expect the subclass to not be a Base derivative for the AutoPopulate subclass to function, I think it'd make more sense to simply make AutoPopulate itself an abstract subclass of the Base class. This way, when one wants to implement a table with populate functionality, the class has to only inherit from AutoPopulate.