vivarium-collective / vivarium-core Goto Github PK

View Code? Open in Web Editor NEW

20.0 5.0 2.0 10.21 MB

Core Interface and Engine for Vivarium

Home Page: https://vivarium-core.readthedocs.io/

License: Apache License 2.0

Python 51.59% Shell 0.15% Jupyter Notebook 48.26%

multi-scale engine vivarium

vivarium-core's People

Contributors

Stargazers

Watchers

Forkers

alexandersimws thalassemia

vivarium-core's Issues

move fasta library to vivarium-cell

The fasta library, currently still in vivarium-core, is used to load for genome sequence data, which is only used in vivarium-cell. This should be moved, and all imports in vivarium-cell updated.

replace all uses of distutils

Python 3.10 deprecates distutils. It'll be removed in Python 3.12.

So with the move to support Python 3.10 in #135 , we should replace distutils in all vivarium projects. setuptools embeds a forked (and updated?) copy of distutils but there are compatibility issues and it's moving away from being a CLI tool to being just a library. setup.py is not dead but apparently we should not run it directly. It's a mess.

(Presumably wcEcoli will stay on Python 3.8.7 and distutils. I did try updating it to setuptools but reverted that due to CovertLab/wcEcoli#1113.)

PEP 517 defines how to describe a project in a small, declarative pyproject.toml file.

Modern tools to consider:

Flit. Experts recommend Flit and it should be easier and more robust to use and maintain. Flit handles the normal cases but Flit does not support Cython code so it won't work for vivarium-ecoli. It makes sense for us to learn one modern tool and use it in all vivarium projects.
build.
Pipenv (there's also a Pipenv guide on Real Python)
Poetry
Bento
Meson is a build tool like pyinvoke and Ruby rake, not a packaging tool, but apparently Numpy & Scipy are moving to [Meson]. Maybe it includes Python packaging features? Anyway, don't switch to numpy.distutils.
...

Some of these tools can also obviate the Makefile.

Schema validation fails with Numpy arrays

The equality comparisons here and here throw the following error when any values are Numpy arrays:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

For example, this occurs when the new and current schemas are {'_divider': {'divider': 'set_value', 'config': {'value': np.zeros(10)}}}. The above Exception is raised due to the np.zeros(10) buried deep within the _divider schema. A custom recursive function is needed to check all schema values for embedded Numpy arrays and compare them separately using np.array_equal.

Handling Multi-Updates

We lack robust support for processes with more than one port to the same store.

The multi-update workaround introduced in #28 allows ports that point to multiple leaf stores to share leaf stores. In this case, updates to the shared leaf stores are aggregated into lists and applied sequentially.

Unfortunately, this workaround is very fragile. Here is an non-exhaustive list of scenarios that break this solution:

The process defines an initial_state method that provides initial values for the overlapping ports (patched in #205)
The overlapping stores contain values of type dict
The topology for one or more of the the overlapping ports is a dict instead of a tuple (e.g. port: {'_path': ('bulk',)})
One or more of the overlapping ports points directly to a single leaf store instead of multiple

One potential solution is to generate individual updates for each port and apply them in sequence.

Make instance methods into class methods or static methods where possible

Create CONTRIBUTING.md

We should create a CONTRIBUTING.md file that lays out:

How to start contributing:
- Sign CLA (or say you don't need to sign the CLA)
- [Optional] Open an issue
- Get assigned to an issue by core team
- Fork repository and make your changes
- Open a PR
- Await approval from core team
How to get in touch with us
- Private contact email for security / harassment issues
- For questions / discussions, open discussion
- For feature requests / bug reports, file an issue
"Code of conduct" which may be much shorter / less formal than e.g. the contributor covenant

Raise exception for conflicting dividers

Currently, when a user specifies two different dividers for a variable (e.g. through two different processes wired to that variable), we silently pick on divider:

vivarium-core/vivarium/core/store.py

Line 578 in 10c756c

self.divider = config['_divider']

We should instead raise an exception

Use the Apache License

Proposal: Multiple Production Branches

In #192 it became clear that it would be nice for us to be able to create v2 releases that break v1 functionality without pushing those releases out to all users and breaking their code. Here's how we could do this:

Git Workflow

Create a develop-v1 branch from the current master. This is where we'll put any bug fixes needed for the v1 version of Vivarium Core
Our development moving forward would be on develop-v2.
Like we do currently, we'll use a git tag to label each release.

Versioning and PyPI Workflow

Our v2 releases will at first be pre-releases as defined by PEP-0440. For example, 2.0.0a1 for the first alpha release or 2.0.0b3 for the third beta release. pip should not install any of these pre-releases when a user runs pip install vivarium-core.
- PEP-0440 describes some alternative ways to format the version numbers. For example, instead of 2.0.0a1, we could use 2.0.0-alpha1, 2.0.0-alpha.1, 2.0.0-alpha-1, etc.

Questions to Answer

What do we want to call our main development branch (i.e. what we currently call master)? There's a certain symmetry to calling it develop-v2, but then the name of our main branch would change each time we start a new major release (e.g. when we start developing v3).
How should we format our version numbers (examples above, more details in the PEP)?
When do we want to use alpha and beta releases? Should we just pick one? Note that by convention, alpha releases are for internal testing, while beta releases are for testing by outside users. We may not need that level of granularity, however.

some library design decisions before publishing the Vivarium paper

Some questions to address before publishing the Vivarium paper and inviting library users we can't consult before making incompatible changes:

Are there any unused functions, methods, constants, classes to remove?
Finalize which methods are class methods.
- Superseded by #140
Which definitions should be private? Use __all__ or _xyz names to declare that.
Which modules should be private?
- No modules need to be private
Which Python versions to officially support? 3.6 - 3.9? Declare that in setup.py.
- Done in #135.
Use abc abstract base classes & methods anywhere else?
Deprecate the Deriver feature and use only Composites, keep it, or otherwise unify them?
The library has 130 isinstance() checks! isinstance() should be used sparingly since it makes for fragile code.
- Superseded by #139
Can we move the generated doc files out of git?

Prevent users from mutating parameters used for serialization

We use Process.parameters to serialize processes, which is not reliable when processes mutate their parameters. To prevent this, we should save off a copy of the parameters when the process is initialized and use this copy for serialization

Publish v1.6.1

Can a maintainer publish v1.6.1 on PyPI? Thanks!

Migration to vivarium-collective

Steps to migrate from vivarium to vivarium-collective:

vivarium-core requires pytest

There are a couple uses of pytest in the codebase. It should be added to setup or these uses should be removed:

vivarium-core/vivarium/core/engine.py

Line 24 in aae63ff

import pytest
vivarium-core/vivarium/experiments/store_api.py

Line 1 in aae63ff

import pytest

separate file for toy processes, compartments

There are several Toy processes and compartments in the codebase for testing purposes. Some of these are in composition.py, some in the processes directory (exchange_a.py), some for tests within other files (ToyAgent in meta_division process file). Maybe these can all be pulled into a separate file so that they can be more easily accessed and reused?

deserialize units

Currently Quantity instances with units are serialized with value.magnitude, which strips them of their units. deserialize is not being used in emitter.get_data, and so we lose information about units and this can be a problem. The serializer will need to know which units to convert a value to when serializing and deserializing.

serialize.py: np.float128 and np.complex256 are not implemented on some platforms

np.float128 and np.complex256 do not exist for numpy on some platforms (including Windows, ARM-based MacOS, which breaks scripts when trying to import them.

Deleting np.float128 and np.complex256 on lines 191, 192 in serialize.py fixed the issue for me.

vivarium-core/vivarium/core/serialize.py

Line 191 in aae63ff

np.float16, np.float32, np.float64, np.float128,

A couple of other examples:
mpi4jax/mpi4jax#112
winpython/winpython#613

Decide on Policy for Breaking API Changes

Before we release v1.0.0, we should document a policy for breaking API changes and versioning. I recommend we follow semantic versioning (which we mostly have been already) with the following further policies:

Our supported API consists of all public, documented interfaces that are not marked as being experimental in their docstrings.
Changes to the supported API will be reflected in the versioning according to semantic versioning. They will also be marked as deprecated in the documentation (and a warning will be raised if possible) for at least one minor release before they are removed.
Experimental interfaces may be changed or removed at any time and in any version.
The following are not considered breaking API changes:
- Adding to a function a parameter with a default value (i.e. an optional parameter)
- Adding a new key to a dictionary. For example, expanding the Composite dictionary with an extra key.
- When a function accepts a dictionary as an argument, adding more, optional keys to that dictionary. For example, letting the user specify new configuration options through a config dictionary.

Pull Schema, Topology, and Hierarchy Utilities into Separate Library

Scattered throughout Vivarium, we have functions to perform operations on schemas, topologies, and hierarchies, e.g. inverse_topology. I think the codebase would be easier to understand if we pulled these functions into a new library under vivarium/library so that we can document and test them thoroughly.

In this library, we would have 3 kinds of entities:

Hierarchies are trees of data.
Schemas assign metadata to each node in the tree.
Topologies are like MongoDB projections. They describe how to retrieve data from a hierarchy and re-shape it into a desired structure.

The library would provide functions to do the following:

Reach into a hierarchy and retrieve values in the form specified by a topology.
Invert a topology (this function takes in one topology and returns a new topology).
Given a set of schemas and a set of topologies, create a hierarchy. This function should take parameters to describe how to initialize the hierarchy values from the schema rather than hard-coding a special key like _default. This also shouldn't be dependent on the Store class.
Take a hierarchy and figure out what schemas need to change to effect a change in that hierarchy. This function will also need to take the schemas as an argument.
Perform operations (e.g. updates) over a hierarchy based on a set of schemas.

Possibly related: #100

Proposal: Improved Handling of Parallel Processes

Current Approach

Message Passing

When a process has Process.parallel set to true, we launch a parallel OS process that runs vivarium.core.process._run_update(). run_update() contains a loop that repeatedly receives (interval, state) tuples from a pipe, passes those arguments to Process.next_update(), and sends the result back through the pipe. To stop the parallel process, we pass interval=-1.

Tracking in Main OS Process

In the main OS process (which contains Engine), we store ParallelProcess instances in Engine.parallel. Then when we need an update from a process, we call Engine._invoke_process, which does the following:

If the process is parallel, starts the ParallelProcess computing the update and returns it.
If the process is non-parallel, creates and returns an instance of InvokeProcess, which has an interface similar to ParallelProcess but computes the update immediately. Note that there's an extra invoke_process function that computes the process's update, but this extra level of indirection appears unnecessary.

Problems

The way we currently track parallel processes has a number of downsides:

We store ParallelProcess instances in Engine.parallel, but we also store the original Process instances in the store hierarchy (with a reference in self.processes).
1. This is unnecessary and wastes memory, especially for processes with large parameter dictionaries or internal states.
2. Having the original Process instances in the store hierarchy is confusing. A user can read out the internal state of those processes with no problem, but they're getting the state from a process that hasn't changed since the beginning of the simulation, which is not intuitive.
There's no way to retrieve any information from a process besides its next update. This is a problem when we want to read out a process's internal state (e.g. for debugging or when working with inner simulations).
The extra levels of indirection are confusing. Every time I work on this, I have to trace through the code again to remind myself what all the "invoke" things do.

Proposed Solution

Eliminate the extra invoke_process function that doesn't appear to do anything.
Instead of storing ParallelProcess instances in self.parallel, put them directly into the store hierarchy with references in self.processes. Once a parallel process has been put on its own OS process, there should be no copies of it left in the main OS process.
Systematize message-passing between the main and parallel OS processes as commands with arguments. Process will have a run_command() method that accepts a command name (string) and a tuple of arguments. The _run_update() function would handle the following commands:
1. _halt: Ignores arguments and shuts down the parallel process (equivalent of passing interval=-1 currently)
2. _next_update: Takes (interval, state) as arguments and passes them to Process.next_update()
Process authors can override run_command() to handle more commands, e.g. to return internal state variables.

ParallelProcess will also provide run_command(), but instead of running commands itself, it will pass those commands through the pipe to its child process's _run_update() function, which will in turn pass them to Process.run_command().

I've started implementing this in #198

I think this proposal addresses all the problems described above. However, it brings some new downsides:

Processes in the store hierarchy will not all be Process instances anymore. Some will be ParallelProcess instances. We don't want to make ParallelProcess inherit from Process because it doesn't know enough to do things like generate composites, and adding those missing capabilities is overkill.
This would change the behavior of public functions. I think this is really more of a bug fix than a breaking change since the way we currently store Process instances in the store hierarchy is very counter-intuitive. The biggest breaking change is probably removing Engine.parallel, but I doubt anyone relies on that.

Improve / Replace isinstance() checks

The library has 130 isinstance() checks! isinstance() should be used sparingly since it makes for fragile code.

Vivarium API Bugs & Enhancements

This issue will keep track of different bugs and enhancements in the comments as I test the Vivarium API by trying to create a set of swimming "proto-cell" agents on a lattice.

Update Documentation

Issues to fix:

Pages to audit:

Tutorials
Reference Material

initial_state for generators

We currently have an initial_state in the Generator class, which is inherited by Process. This needs to be overwritten for all use cases or else it gives an exception. But initial_state should work differently for composite generators than it does for process generators. The composite generators can merge the initial states of its constituent processes' initial_state, perhaps with a config that resolves any conflicts between the processes' declared states.

To achieve this I propose: 1) Process has its own initial_state with the same exception approach currently used in Generator. 2) Generator has a different initial_state that calls a new function -- process_initial_states. This new function uses an inverse topology to get the store names corresponding to each processes' ports, calls the processes' initial_state and does a merge. 3) The Generator initial_state can be overwritten for different cases, and the process_initial_states accepts a config that allows it to resolve conflicts between merged states.

Separate Steps and Processes and Add Wirable Interface

Now that we are replacing Derivers with Steps, it's becoming increasingly clear that Steps and Processes are really quite different. It might be a good idea to reflect that distinction in our class hierarchy. We would have two top-level interfaces (abstract classes), Wirable and Composer. Process and Step would both implement each of these interfaces. Note that these interfaces would need to not have overlapping methods since that confuses Python's multiple inheritance.

As part of this shift, we should separate the code needed for wiring things together and put that with the Wirable interface. This will prepare our code to use the idea of wiring things together for purposes besides running simulations, e.g. running workflows or analyses

plot_topology additional features

#64 greatly expanded plot_topology to include several graph_formats (hierarchy, vertical, horizontal), manual coordinates, remove_nodes, and node_labels. There are several still-unimplemented features that would even further expand the options for this plot function.

Additional features that might be of interest include:

collapse_nodes to reduce multiple nodes into one, while keeping their connections.
remove_labels --- selectively remove labels. Currently only one node can have an empty string (''), since this serves as a unique identifier. It would be useful if multiple nodes could be left without labels.

Process Serialization with Inheritance

Currently, Process.__init__() saves off a copy of the parameters to self._original_parameters. Then when serializing, Process.__getstate__() returns self._original_parameters and Process.__setstate__() calls self.__init__() with the parameters.

This is problematic in class hierarchies where constructors change parameters. For example:

class A(Process):

    def __init__(self, parameters):
        super().__init__({'2': parameters['1']})

a = A({'1': True})
a2 = pickle.loads(pickle.dumps(a))

Here, A receives a parameter under the key 1 but passes that parameter to Process under key 2.

Here's how the serialization/deserialization flow looks:

When an instance of A is serialized, Process.__getstate__() returns {'2': True}.
Then when we want to deserialize and get a2, Process.__setstate__() calls self.__init__(parameters). However, it is A.__init__() that runs here, not Process.__init__(), so A.__init__() will be looking for a nonexistent key 1 in the parameters.

The fact that A.__init__() runs is correct. There could be code there that we need to run to initialize instance variables. The problem is that a.__getstate__() should be returning the original parameters passed to A.__init__(), not those passed to Process.__init__().

Here are some possible solutions:

Add an optional parameter to Process.__init__() for the original subclass parameters, which will then be saved in self._original_parameters. This route will require that we update all constructors in an inheritance hierarchy to handle the new parameter.
Add a new Process.save_parameters() method that subclasses can call to save off their original parameters. With this approach, we will only have to change Process and the class at the bottom of the hierarchy. However, we will have to handle multiple calls to Process.save_parameters() since we only want to save the parameters from the child-most subclass.
Add a new special key to the parameters dictionary for the original parameters. If provided, Process.__init__() will save this in self._original_parameters instead of parameters.

I prefer approach (3) since I think it's most in line with how we pass other special configs to Process.

Remove MultiInvoke?

We currently define a MultiInvoke class in experiment.py:

vivarium-core/vivarium/core/experiment.py

Line 124 in a92f8eb

class MultiInvoke:

However, I can't find any usages of it in vivarium-collective or in my wcEcoli work. It seems that we're handling parallel processing using ParallelProcess instead.

Can MultiInvoke be removed?

Specify Versions in setup.py

Right now we don't specify versions in setup.py. We should add them to help pip solve environments when users install vivarium-core. For example, we could specify numpy>=0.1 to say that numpy must be at least version 0.1 to work with vivarium-core

Code Health

Note: This depends on #22.

Code Style

Get all code to pass lint checks
- core/composition.py
- core/control.py
- core/emitter.py
- core/experiment.py
- core/store.py
- core/process.py
- core/registry.py
Complete docstrings
- core/composition.py
- core/control.py
- core/emitter.py
- core/experiment.py
- core/store.py
- core/process.py
- core/registry.py

How to Use Pylint

To run pylint, just execute pylint vivarium.

The files left to fix are listed in .pylintrc after ignore=. We use an exclusion list to ensure that once we fix a file, any code added to it also passes the linter. This prevents regressions. It also ensures that any new files added have proper style.

I use pylint's default rules. You can adjust this in .pylintrc after the disable= line. You can find the list of pylint checks here.

Fixing a File

Delete the file from the list in .pylintrc
Run pylint by executing pylint vivarium
Fix all the errors that are shown
Repeat steps 2-3 until pylint finds no errors

Test Coverage

Reach 100% code coverage
- core/composition.py
- core/control.py
- core/emitter.py
- core/experiment.py
- core/store.py
- core/process.py
- core/registry.py

How to Calculate Coverage

I use the pytest-cov plugin to calculate test coverage. Whenever you run the tests, add the --cov=vivarium argument to calculate coverage. At the end of the test output, you'll get a table like this:

Name                                                  Stmts   Miss  Cover   Missing
-----------------------------------------------------------------------------------
vivarium/__init__.py                                     37      0   100%
vivarium/composites/__init__.py                           0      0   100%
vivarium/composites/injected_glc_phosphorylation.py      13      0   100%
vivarium/core/__init__.py                                 0      0   100%
vivarium/core/composition.py                            403    180    55%   62, 64-68, 141, 143, 156-157, 179-193, 200-221, 247-333, 347, 364-370, 375-386, 417, 445-456, 527-540, 543-544, 548-552, 559-569, 578-580, 614-643, 686-706, 746-755, 998-1000
vivarium/core/control.py                                 85     13    85%   40, 42, 59-60, 63-64, 89-90, 104-106, 112, 119, 211
vivarium/core/emitter.py                                156     80    49%   32-33, 55, 59, 65, 70-73, 86-94, 103, 116, 119, 128, 176-189, 192-196, 199, 203-212, 216-222, 226-227, 232-280, 284-298, 304-312, 320-324
vivarium/core/experiment.py                            1018    104    90%   93, 102, 171, 173, 182-184, 191, 245, 250-251, 262, 267, 271-272, 285-286, 288, 293, 311, 327-328, 381-384, 416, 437, 442, 447, 452, 460-467, 481, 485, 487, 490, 500-502, 520-523, 665-672, 827-831, 855-872, 909, 920, 922, 948-949, 990-997, 1014, 1069, 1076, 1096, 1348-1352, 1461-1462, 1494-1495, 1517, 1546, 1548, 1570-1576, 1677, 2254
vivarium/core/process.py                                277     40    86%   30, 41, 43, 51, 78-81, 100-104, 211, 227, 274, 277-279, 319, 337, 353-361, 370, 386, 392-394, 401, 457, 459, 507, 516, 564
vivarium/core/registry.py                               118     57    52%   77-78, 115-122, 146-153, 164, 182-197, 202-204, 212, 227, 238, 246, 249, 253, 256, 260-264, 271-275, 284-286, 289, 294-296, 300
vivarium/experiments/__init__.py                          0      0   100%
vivarium/experiments/glucose_phosphorylation.py          48      5    90%   91-96, 105
vivarium/library/__init__.py                              0      0   100%
vivarium/library/datum.py                                28     19    32%   2-3, 6-7, 28-41, 44, 47, 50
vivarium/library/dict_utils.py                          163     75    54%   17-20, 33-36, 38, 53, 68, 70, 74, 94, 113-117, 127-132, 136-144, 150-152, 156-175, 183-202, 206, 230-236, 259-261
vivarium/library/fasta.py                                10      8    20%   4-12
vivarium/library/filepath.py                             35     22    37%   22-25, 31-34, 51-58, 73-77, 81-82, 87-89, 93-94
vivarium/library/make_network.py                         87     79     9%   15-27, 33-40, 55-92, 96-125, 136-171
vivarium/library/path.py                                  4      2    50%   4-5
vivarium/library/pretty.py                               24      5    79%   16, 18, 20-21, 26
vivarium/library/schema.py                               39     28    28%   4, 7, 12, 18-25, 28-40, 45-52, 55, 62, 69, 77
vivarium/library/timeseries.py                           32     27    16%   27-50, 61-69
vivarium/library/topology.py                            148     17    89%   36-37, 58-61, 93-98, 107-116
vivarium/library/units.py                                38      6    84%   62-66, 90
vivarium/plots/__init__.py                                0      0   100%
vivarium/plots/agents_multigen.py                       135    127     6%   13-24, 64-226
vivarium/plots/simulation_output.py                     103     16    84%   70, 76, 78-79, 82, 84, 99-103, 122, 129, 132, 138, 172
vivarium/plots/topology.py                              105     17    84%   40, 137-140, 151, 166-180, 187, 202, 255, 268-277
vivarium/processes/__init__.py                            0      0   100%
vivarium/processes/agent_names.py                        14      6    57%   12-14, 17, 29-30
vivarium/processes/burst.py                              58      6    90%   208-212, 217
vivarium/processes/derive_concentrations.py              28     19    32%   20-31, 34, 53-69
vivarium/processes/derive_counts.py                      24      8    67%   32, 35, 55-62
vivarium/processes/divide_condition.py                   15      7    53%   10-11, 14, 17, 25-28
vivarium/processes/engulf.py                             58      6    90%   183-187, 191
vivarium/processes/exchange_a.py                         15      0   100%
vivarium/processes/glucose_phosphorylation.py            39      5    87%   129-139
vivarium/processes/injector.py                           37      9    76%   57, 103-109, 113
vivarium/processes/meta_division.py                      76      9    88%   29, 52, 65, 207-211, 215
vivarium/processes/nonspatial_environment.py             27     15    44%   22-24, 27-72, 75-87
vivarium/processes/remove.py                             47      6    87%   116-120, 124
vivarium/processes/swap_processes.py                     63      6    90%   175-179, 183
vivarium/processes/template_process.py                   38      7    82%   138-146, 151
vivarium/processes/timeline.py                           57      5    91%   43-44, 51, 56-57
vivarium/processes/tree_mass.py                          50      7    86%   69, 168-172, 176
-----------------------------------------------------------------------------------
TOTAL                                                  3752   1048    72%

Required test coverage of 72.0% reached. Total coverage: 72.07%

This lists the fraction of lines in each file that are covered by the tests. It also tells you which lines in the file weren't covered so you can add tests for them.

Notice the last line says we require a code coverage of 72%. This is just the current coverage value. If coverage dips below 72%, the tests will fail to make sure we don't regress.

How to Improve Coverage

Run the tests with pytest --cov=vivarium.
Pick a file that is not completely covered (coverage < 100%) using the coverage output.
Match the lines listed as not being covered to the source file. These are the lines you need to write tests for.
Add tests that will check that the uncovered lines are working correctly.
Run the tests again. You should get a coverage value of 100% for that file.
Update the minimum coverage in .coveragerc to your current coverage value for the entire project.

Static Typing

Annotate all code with types and resolve all issues raised by mypy
- core/composition.py
- core/control.py
- core/emitter.py
- core/experiment.py
- core/process.py
- core/registry.py

How to Use Mypy

We use type annotations as specified by PEP 484 and prefer annotations to special comments like # type: str. See the documentation for typing for more information.

Mypy is configured using the mypy.ini file, which looks like this:

[mypy]

disallow_untyped_defs = True
disallow_incomplete_defs = True

# Ignore missing types from third-party libraries.
[mypy-numpy.*]
ignore_missing_imports = True
...

# Ignore missing types from Vivarium Core files we haven't typed yet.
[mypy-vivarium.core.registry.*]
disallow_untyped_defs = False
disallow_incomplete_defs = False

[mypy-vivarium.core.control.*]
disallow_untyped_defs = False
disallow_incomplete_defs = False
...

Notice how we first set disallow_untyped_defs and disallow_incomplete_defs to True. This means that by default, mypy will raise an error whenever a function is not completely typed. Then, for every file that hasn't been typed yet, we include a section like [mypy-vivarium.core.registry.*] for vivarium/core/registry.py that sets those options to False. As we add type hints, we can remove files from this list until everything has type annotations.

Also note that we tell mypy to ignore types in our dependencies like numpy. Many third-party libraries are poorly-typed, and we want to avoid spending time trying to get mypy to work with them. We will focus on making sure the types in our code are correct instead.

How to Add Types

Pick a file listed in mypy.ini as not being covered yet.
Delete the file's section from mypy.ini.
Run mypy by executing mypy vivarim
Fix any errors raised by mypy
Repeat 3-4 until no errors are raised

Post-v1.0.0 Documentation Updates

Verify that the notebook tutorials still work
Document Composer and MetaComposer better
Document the available arguments for the Engine constructor
In processes.rst, document advanced updater techniques (_add, _delete, _move, {'_updater': ...})
Document analyses
Document Engine.run_for()

Bug with "mixed relative ports schemas"

The following bug was encountered while trying to wire together FBA and bioscrape in this notebook..

Overview: The goal is to have a FluxDeriver connect to both a Bioscrape rate and a Cobra flux_bound.

The following absolute topology worked:

topology = {
            'bioscrape': {
                # all species go to a species store on the base level,
                # except Biomass, which goes to the 'globals' store, with variable 'biomass'
                'species': {
                    '_path': ('species',),
                    'Biomass': ('..', 'globals', 'biomass'),
                },
      
                'delta_species': ('delta_species',),
                'rates': ('rates',),
                'globals': ('globals',),
            },
            
            'cobra': {
                'internal_counts': ('internal_counts',),
                'external': ('external',),
                'exchanges': ('exchanges',),
                'reactions': ('reactions',),
                'flux_bounds': ('flux_bounds',),
                'global': ('globals',),
            },
            
            'flux_deriver': {
            
                'deltas': ('delta_species',),
                'amounts':('globals',),
                
                # connect Bioscrape deltas 'Lactose_consumed' and 'Glucose_internal'
                # to COBRA flux bounds 'EX_lac__D_e' and 'EX_glc__D_e'
                # also connect biomass flux to the dilution rate
                
                'fluxes':
                    {
                        #'_path': ('flux_bounds',),
                        'Lactose_consumed': ('flux_bounds','EX_lac__D_e',),
                        'Glucose_internal': ('flux_bounds','EX_glc__D_e',),
                        'biomass':('rates', 'k_dilution__',)
                    }
            },
            
            'mass_deriver': {
                'global': ('globals',),
            },
            
            'volume_deriver': {
                'global': ('globals',),
            },
            
            'biomass_adaptor': {
                'input': ('globals',),
                'output': ('globals',),
            }
        }

The following relative topology did not correctly update ('fluxes', 'biomass',) --> ('..', 'rate', 'k_dilution__',). Note that the updates from the FluxDeriver seemed fine and that the other flux ports were updated correctly.

topology = {
            'bioscrape': {
                # all species go to a species store on the base level,
                # except Biomass, which goes to the 'globals' store, with variable 'biomass'
                'species': {
                    '_path': ('species',),
                    'Biomass': ('..', 'globals', 'biomass'),
                },

                'delta_species': ('delta_species',),
                'rates': ('rates',),
                'globals': ('globals',),
            },

            'cobra': {
                'internal_counts': ('internal_counts',),
                'external': ('external',),
                'exchanges': ('exchanges',),
                'reactions': ('reactions',),
                'flux_bounds': ('flux_bounds',),
                'global': ('globals',),
            },

            'flux_deriver': {

                'deltas': ('delta_species',),
                'amounts': ('globals',),
                # connect Bioscrape deltas 'Lactose_consumed' and 'Glucose_internal'
                # to COBRA flux bounds 'EX_lac__D_e' and 'EX_glc__D_e'
                # also connect biomass flux to the dilution rate

                'fluxes':
                    {
                        '_path': ('flux_bounds',),
                        'Lactose_consumed': ('EX_lac__D_e',),
                        'Glucose_internal': ('EX_glc__D_e',),
                        'biomass':('..', 'rates', 'k_dilution__',)
                    }
            },
            'mass_deriver': {
                'global': ('globals',),
            },
            'volume_deriver': {
                'global': ('globals',),
            },
            'biomass_adaptor': {
                'input': ('globals',),
                'output': ('globals',),
            }
        }

This problem was fixed by pointing the bioscrape ('rate', 'k_dilution__') at the 'flux_bounds' store while still using a relative topology (previously it was effectively wired the other way around - one would expect both the above and the below to work). This suggests that the there is an issue when using relative topologies that point to very different stores. The working relative topology is given below:

topology = {
            'bioscrape': {
                # all species go to a species store on the base level,
                # except Biomass, which goes to the 'globals' store, with variable 'biomass'
                'species': {
                    '_path': ('species',),
                    'Biomass': ('..', 'globals', 'biomass'),
                },
                'delta_species': ('delta_species',),
                'rates': {
                    '_path' : ('rates',),
                    'k_dilution__': ('..', 'flux_bounds', 'k_dilution__'),
                },
                'globals': ('globals',),
            },

            'cobra': {
                'internal_counts': ('internal_counts',),
                'external': ('external',),
                'exchanges': ('exchanges',),
                'reactions': ('reactions',),
                'flux_bounds': ('flux_bounds',),
                'global': ('globals',),
            },

            'flux_deriver': {

                'deltas': ('delta_species',),
                'amounts': ('globals',),

                # connect Bioscrape deltas 'Lactose_consumed' and 'Glucose_internal'
                # to COBRA flux bounds 'EX_lac__D_e' and 'EX_glc__D_e'
                # also connect biomass flux to the dilution rate

                'fluxes':
                    {

                        '_path': ('flux_bounds',),
                        'Lactose_consumed': ('EX_lac__D_e',),
                        'Glucose_internal': ('EX_glc__D_e',),
                        'biomass':('k_dilution__',)
                        #'Lactose_consumed': ('flux_bounds','EX_lac__D_e',),
                        #'Glucose_internal': ('flux_bounds','EX_glc__D_e',),
                        #'biomass':('rates', 'k_dilution__',)

                    }
            },

            'mass_deriver': {
                'global': ('globals',),
            },

            'volume_deriver': {
                'global': ('globals',),
            },

            'biomass_adaptor': {
                'input': ('globals',),
                'output': ('globals',),
            }
        }
        ```

Permissions and Responsibilities Structure

GitHub offers the following permission levels for organization repositories (docs):

Read: Recommended for non-code contributors who want to view or discuss your project

Triage: Recommended for contributors who need to proactively manage issues and pull requests without write access

Write: Recommended for contributors who actively push to your project

Maintain: Recommended for project managers who need to manage the repository without access to sensitive or destructive actions

Admin: Recommended for people who need full access to the project, including sensitive and destructive actions like managing security or deleting a repository

I propose the following structure of roles for Vivarium and Vivarium Core (I'm ignoring the other Vivarium repos for now):

Maintainers: These are the core "owners" of the Vivarium project. They have the owner role in the Vivarium Collective organization and the admin role in Vivarium Core. They are responsible for enforcing the code of conduct, responding to security vulnerabilities, and adjudicating disputes.
Collaborators: These are trusted contributors who help develop Vivarium projects. They have the member role in the Vivarium Collective organization and the write role in Vivarium Core. They are empowered to manage issues, open PRs, and otherwise develop Vivarium Core.
Public: Anyone can read the Vivarium Core repo, and anyone can contribute by forking the repository and opening a PR. Anyone can also open or comment on a discussion thread or issue.

Remaining Questions:

Should we require PRs to be approved by certain people? This would probably mean maintainers approve PRs, but collaborators can merge them.
The Apache CLA is based around the idea of an entity to whom contributors grant a license to use their contributions. Who should this entity be? Maybe the maintainers?

inter-repository automated testing

When vivarium-core updates, it has the potential to break many other vivarium projects. Because of this, it would be useful to implement automated testing that can check other repos before making changes to vivarium-core. Here is a link to some useful discussion about how this could be done: https://blog.marcnuri.com/triggering-github-actions-across-different-repositories/

External Controls for Dealing with Conflicting Updaters

Consider the case when there are two processes A and B which by default each update a store X by setting.

Currently, the order A and B are added to the Bigraph determines which setter takes precedent. It would be very helpful to have a flag in the schema {path: {X : _update : False} which can be used to override the default schema to turn off updating from one of the processes or the other to remove this order dependence.

Interrupts for Engine

Currently Engine.front tracks all processes next updates, with a dict that has the next update times and the updates that will be applied: {'time': 14.0, 'update': {update-to-be-applied}}. The process's next update time is determined by the global time plus the timestep given by its Process.calculate_timestep().

Having interrupts would allow other processes to change a given process's next update time to be earlier. This might be necessary if, for example, an event such as chromosome replication needs to reset transcription at a given time. If the interrupt time is before the previously-determined next update time, the transcription process would need to be rerun from its previous initial state, and would return a new update dict: {'time': 13.0, 'update': {new}}.

To add this feature requires that we save the initial state provided for the process, so it could be re-run from that state at a shorter interval. Something like: {'time': 14.0, 'update': {update-to-be-applied}, 'state': {saved-view}}. It would also require the interrupt times to be received and handled by Engine.run_for.

Possible Bug in Store.check_default

We have this code in Store.check_default():

vivarium-core/vivarium/core/store.py

Lines 132 to 135 in a92f8eb

 if isinstance(self_default_comp, np.ndarray): 

 self_default_comp = self.default.tolist() 

 if isinstance(new_default_comp, np.ndarray): 

 new_default_comp = self.default.tolist()

Notice that we set new_default_comp to self.default.tolist() even when it is new_default that is a numpy array. This seems like a bug.

On a related note, we warn about a schema conflict when the new and existing defaults are the same. Shouldn't we warn about that when the defaults differ? Here's the whole function (link):

    def check_default(self, new_default):
        defaults_equal = False
        if self.default is not None:
            self_default_comp = self.default
            new_default_comp = new_default
            if isinstance(self_default_comp, np.ndarray):
                self_default_comp = self.default.tolist()
            if isinstance(new_default_comp, np.ndarray):
                new_default_comp = self.default.tolist()
            defaults_equal = self_default_comp == new_default_comp
        if defaults_equal:
            if (
                not isinstance(new_default, np.ndarray)
                and not isinstance(self.default, np.ndarray)
                and new_default == 0
                and self.default != 0
            ):
                log.debug(
                    '_default schema conflict: %s and %s. selecting %s',
                    str(self.default), str(new_default), str(self.default))
                return self.default
            log.debug(
                '_default schema conflict: %s and %s. selecting %s',
                str(self.default), str(new_default), str(new_default))
        return new_default

remove self-register of processes

Some processes self-register by invoking upon import, for example derive_concentrations and derive_counts. Now that we have the __init__ setting up registration, it would be better to just import these processes, import the registry, and manually register them. This is cleaner than self-registration and sets up a nice pattern for process registration.

Composite-Composer-MetaComposer

Some discussed changes to the vivarium class structure include:

Making a Composite object, which has processes and topology attributes. This would be a short-lived object, as its processes and topology would be passed into an experiment for execution. An advantage would be improved application of developer tools, and type checking. Also, it could have its own initial_state method and merge method.
With the above change, Composer would return Composites.
Experiment could also accept Composite, and call its processes and topology attributes.
A MetaComposer would be a composer of Composers. In its generate_processes and generate_topology it would call its component Composers and merge their generated Composites.

More flexible topology rewiring for glob schemas

Consider a process (e.g. multibody physics) with a ports schema like this:

{
    'agents': {
        '*': {
            'a': {...},
            'b': {...},
        },
    },
}

Now imagine that we want to use this process with an agent with a store hierarchy like this:

{
    'a': {...}
    'boundary': {
        'b': {...},
    },
}

We don't currently have a way to wire the environment process to this agent.

Improve Handling of Process Parameters and Defaults

Status Quo

Current Approach

Currently, processes have a defaults class variable that specifies the parameters they expect. Then when calling a process, you can provide parameters in a dictionary to the parameters argument to override those defaults. For example:

class MyProcess(Process):

    defaults = {
        'flag': True,
    }

proc = MyProcess({'flag': False})

Problems with Current Approach

Our current approach for handling process parameters and defaults has a number of issues:

Parameters do not have types, which prevents type checking from catching typing errors. For example, you could write MyProcess({'flag': 'test'}) without raising any type errors.
Similarly, we don't actually enforce the convention that all accepted parameters be present in defaults. For example, MyProcess could have a next_update that uses self.parameters['flag2'] even though MyProcess.defaults does not contain flag2.
Other Python tools don't understand our way of specifying parameters. This means that e.g. PyCharm won't suggest parameters when you initialize a process.
Our approach is not the standard Python way of specifying parameters, which is confusing.
defaults are not handled correctly when subclassing processes. For example, consider the process class MyProcess2(MyProcess) with a constructor that calls MyProcess.__init__(parameters). Inside MyProcess.__init__, self.defaults refers to MyProcess2.defaults, not MyProcess.defaults like you'd expect.

Functionality to Preserve

Backwards compatibility. We don't want to have to rewrite all our libraries.
Ability to save off parameters for serialization.

Possible Alternative

All the problems would be solved by using normal Python function arguments for process parameters. Here's how we could preserve the desired functionality:

Not sure how we can do this yet.
I think we can use inspect and locals() to do this. We would, in Process.__init__(), first use self.__class__ to get the subclass. Then we could use code like this to get the parameters passed to the subclass constructor:
```
>>> def foobar(foo, bar, baz):
...     sig, foobar_locals = inspect.signature(foobar), locals()
...     return [foobar_locals[param.name] for param in sig.parameters.values()]
...
>>> foobar(1, 2, 3)
[1, 2, 3]
```
Source: https://stackoverflow.com/a/10724602

Note that the locals() call would have to happen in the subclass, and the result would need to be passed to the superclass constructor.

Stop adding units to emitted data keys

Currently, when we emit data that has a unit in the state hierarchy and format it as a timeseries, we put the unit into the key and strip the units from the value. For example, we might get:

{
    'external': {
        ('antibiotic', 'millimolar'): [1, 2, 3],
    }
}

This is not ideal. Instead, our plotting functions should know how to handle Quantity objects.

Here's where we do the unit conversion:

vivarium-core/vivarium/library/dict_utils.py

Lines 264 to 268 in 6836411

 if isinstance(value, Quantity): 

 unit_key = (key, str(value.units)) 

 if unit_key not in timeseries: 

 timeseries[unit_key] = [] 

 timeseries[unit_key].append(value.magnitude)

And here's a trace of the calls in emitter.py that lead to the code above:

vivarium.library.dict_utils.value_in_embedded_dict()
vivarium.core.emitter.timeseries_from_data()
vivarium.core.emitter.Emitter.get_timeseries()

Note that addressing this will probably be a breaking API change since doing so will change our output format.

precision topology_view caching

If a '_generate' or '_divide' update is triggered it requires the expiration and re-caching of some topology_views. Currently, Store.apply_update will pass a view_expire flag (here) back to Engine, which triggers self.state.build_topology_views(): (here). This expires and re-caches ALL of the topology_views across the entire Store hierarchy. It would be more efficient if it knew to only expire and re-cache the topology_view of the subset of processes that are connected to the affected stores.

	if isinstance(self_default_comp, np.ndarray):
	self_default_comp = self.default.tolist()
	if isinstance(new_default_comp, np.ndarray):
	new_default_comp = self.default.tolist()

	if isinstance(value, Quantity):
	unit_key = (key, str(value.units))
	if unit_key not in timeseries:
	timeseries[unit_key] = []
	timeseries[unit_key].append(value.magnitude)

vivarium-collective / vivarium-core Goto Github PK

vivarium-core's People

Contributors

Stargazers

Watchers

Forkers

vivarium-core's Issues

Git Workflow

Versioning and PyPI Workflow

Questions to Answer

Current Approach

Message Passing

Tracking in Main OS Process

Problems

Proposed Solution

Code Style

How to Use Pylint

Fixing a File

Test Coverage

How to Calculate Coverage

How to Improve Coverage

Static Typing

How to Use Mypy

How to Add Types

Status Quo

Current Approach

Problems with Current Approach

Functionality to Preserve

Possible Alternative

Recommend Projects

Recommend Topics

Recommend Org