kassonlab / run_brer Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 11.62 MB

Python package for running bias-resampling ensemble refinement (BRER) simulations

License: GNU Lesser General Public License v2.1

Shell 7.36% Python 92.64%

run_brer's People

Contributors

Watchers

Forkers

eirrgang carolinedav scale-ms

run_brer's Issues

Need more flexibility in resource assignment.

run_brer does not provide a way to pass information about allocated resources to the simulator. We need to make sure that GROMACS is not expected to auto-detect the number of threads available in all cases.

This may or may not be related to the common observation of messages related to "Non-default thread affinity" and subsequent failure to pin threads to CPU cores (a potentially significant performance issue).

Need a better failure mode for run_config.py when Context instance is missing `potentials` attribute.

The potentials attribute on the Context instance is a lingering wart from an incomplete API for retrieving data from pluggable MD extensions after simulations have run. (See https://gitlab.com/gromacs/gromacs/-/issues/3145)

The attribute is not created (by the _md builder) until the Session is launched.

Launch errors should prevent

run_brer/run_brer/run_config.py

Line 470 in 1ed6857

 start_times = [potential.start_time for potential in context.potentials if hasattr( 

from being reached.

However, the attribute may be missing if the run_brer code is being evaluated on a rank that does not participate in the simulation.

If there are no pluggable potentials attached, there should still be an empty list, but we should probably check for that, too.

Can anyone else think of any other reasons why context.potentials would be missing or invalid?

Improve support for ensemble management

Right now, a separate config file is needed for each ensemble member. We can probably make that more user-friendly. We need to consider, though, how that would might interact with restartability for jobs that require multiple submissions.

Needs updates for gmxapi 0.1+

The gmxapi interfaces for GROMACS 2020 and higher (gmxapi>=0.1) are not trivially compatible with gmxapi 0.0.7 and GROMACS 2019. However, it shouldn't be that hard to write some adapters and compatibility logic for a simple port.

Full migration to updated features can be treated in follow-up issues.

Note that the brer GROMACS MD extension (plugin) likely requires some updates, too, but the plugin interface has actually changed very little, so far.

Checkpoint transfer from convergence to production may not be sufficiently robust

@carolinedav reported that production phase failed when simulations tried to start with zero-sized checkpoint files.

I'm not sure how this happened, but there are a few ways it could conceivably happen. We should figure out how to reproduce and examine log output (such as from line

run_brer/run_brer/run_config.py

Line 135 in ae50d68

 self._logger.info("Phase is {} and state.cpt already exists: not moving any files".format(phase)) 

)

One set of issues could be related to the non-atomicity of file copies. Even leaving aside possible race conditions for the moment, there do not seem to be provisions for recovering from a partial copy of the checkpoint file after a failure at

run_brer/run_brer/run_config.py

Line 153 in ae50d68

shutil.copy(gmx_cpt, '{}/state.cpt'.format(os.getcwd()))

in subsequent runs.

One way to make this section of code more robust would be to copy the checkpoint file to a temporary file in the target directory and then move the file (an atomic operation in all relevant environments) to its expected name only after the copy is successful. There are other measures we could take, but this is by far the simplest, and is a reasonable step to take regardless of anything else.

Incorporate `brer_plugin` into `run_brer`.

I am strongly considering merging the brer_plugin repository into the run_brer repository. I don't see any particularly good reason to keep them separate, and it can be a pain to maintain both. Also, then brer_plugin would automatically be available on PyPI as part of run_brer. We might choose to rename the package to, simply, brer, (or perhaps brer-md?)

The main potential downside is that the brer_plugin is trickier to install than run_brer because of its C++ dependencies on GROMACS libraries. However,

run_brer already depends on the gmxapi Python package, which similarly depends on a GROMACS installation.
We don't currently have any use cases for run_brer that don't necessitate a brer_plugin installation, anyway. Overall, a single installation is arguably simpler and can benefit from environment inspection tools in gmxapi. (Again, the gmxapi package is a dependency of run_brer, but brer_plugin currently has only a primitive CMake level dependency on GROMACS libraries, for simplicity. But having gmxapi available when doing the brer_plugin install would be handy.)

Then again, we might have use cases that don't involve brer_plugin if we incorporate more analysis features into run_brer. This, in turn, could be mitigated by putting the analysis features into a separate package that is even lighter-weight than the current run_brer. Maybe brer-tools? This is a somewhat hypothetical point, though, and doesn't necessarily need to be resolved at the same time.

Notes on legacy version support

I believe that run_brer, brer_plugin, and the brer-md works in progress all support GROMACS 2019 and gmxapi 0.0.7 through GROMACS 2023 and gmxapi 0.0.4. We need to make sure this is tested, though, and if there are installation caveats, those need to be documented.

We also need to revitalize our Singularity (and Docker?) recipes.

For users of gmxapi 0.0.7, it may be appropriate to provide a 0.0.7 sdist on https://pypi.org/project/gmxapi/.

In any case, we need to make sure the instructions are clear for GROMACS versions back through 2019 and gmxapi versions back through 0.0.7.

Example case

At one external site, the sysadmins make BRER available to users through Singularity with the following scriptlet in their image builder.

git clone --depth=1 -b release-0_0_7 https://github.com/kassonlab/gmxapi.git
cd gmxapi
mkdir build; cd build
cmake ../ -Dgmxapi_DIR=/usr/local/gromacs/share/cmake/gmxapi -DGMXAPI_INSTALL_PATH=/usr/local/lib/python3.6/dist-packages/gmx
make -j8 install

Avoiding version mismatch problems

Modern Python package build front-ends use an isolated build environment by default. This isolated environment may or may not have access to the parent environment's system-site-packages. This could cause confusion for users who intend to use an older version of gmxapi that is not binary compatible with the latest version.

It may be appropriate to advise people who want less-than-the-latest versions of dependencies to install dependencies first and use --no-build-isolation and/or --no-deps when installing brer-md. Otherwise, it is conceivable that someone may end up with the brer.md C++ extension built for the latest gmxapi in the automatically-generated ephemeral build environment, but then the package may be installed in an environment with older, possibly binary-incompatible packages. (We are trying to keep gmxapi ABI compatible for 0.0.3.1+, but previous versions are definitely not binary compatible.)

To reduce surprising errors, we should probably customize the metadata that is generated for the built package (the wheel) to require exactly the gmxapi version for which it was built. We are only distributing sdist packages on pypi.org, since the wheel depends on too many details about the local gromacs installation. Since the wheel will not be portable outside of its intended installation environment, it can have stronger site-specific dependencies that would not be suitable for an archive on a public package index. We will need to document what users will encounter if they attempt to install the locally built wheel in an environment with a different gmxapi installation, and what their options are.

We also need to consider (and document) what the user will experience and how to upgrade the brer-md dependent when the gmxapi (dependency) package is upgraded to an ABI incompatible version. (Our advice still can and should be "Use a fresh venv for each research project and keep a consistent tool kit in the interest of scientific reproducibility," but we don't have to be pushy.)

Training phase may not be robust to simulation failure.

It may be that failed simulations during the training phase do not prevent the run_brer state from advancing.

This would allow the convergence phase to begin with alpha values that aren't trained, such as the initial value of zero. (This is a pathological case that has appeared in practice, but we aren't completely sure of the cause yet.)

Update Singularity image

If necessary, improve docs for people who want to update it themselves.

Automate documentation builds

Currently, the contents of the docs directory are meant to be automatically used to update the GitHub Pages content.

This means someone has to run sphinx-build locally and git commit the content.
This bloats the repository and clutters the commit history.

The GitHub Pages builder (jekyl) will not run sphinx-build for us. Instead, we need to configure a GitHub Actions workflow to build and publish the documentation to the gh-pages branch. For example, see https://www.docslikecode.com/articles/github-pages-python-sphinx/ and https://github.com/annegentle/create-demo/blob/main/docs/buildsite.sh

Also:

If we want to use readthedocs.org, we need to update the config, but
I think we should remove the .readthedocs.yml and agree to go all-in on the GitHub Pages site.

We should try to address this within 2022Q2, since we are making some minor interface changes and documentation additions that we should have reflected in our web docs.

Update GitHub Pages settings to use the gh-pages branch instead of master
Trim docs directory in master and manually update gh-pages branch.
Remove .readthedocs.yml
Automate docs build with GitHub Actions.
Update docs build badge

Update coverage

Add test job for GROMACS 2022 and gmxapi 0.3
De-duplicate push and pull_request GitHub workflows.

Establish versioning and release policies

This issue concerns the run_brer project as well as the https://github.com/kassonlab/brer_plugin project.

We did not import a tag for the run_brer release associated with @jmhays 1.0 code when we moved the repository. I think I have a local repository somewhere that can identify the commit to which such a tag should refer, but it is also possible that there was not a tagged release for run_brer at all. If there was, it seems like it would be reasonable to duplicate the tag in the kassonlab/run_brer project. Otherwise, it seems reasonable to assign one.

The changes in #20 aren't really API breaking, but they definitely warrant a version bump from the version at original publication time. I'm not sure what the version bump should be or whether we should tag a new release right away, but it is significant that the updated package looks for the brer_plugin module rather than the myplugin module.

brer_plugin has a v1 branch corresponding to the sample_restraint fork that @jmhays published with. We could tag it for archival purposes, but we should definitely tag a superior release first to avoid confusion. There are few changes to the @jmhays code other than the Python module name. It may be simplest to skip brer 1.0 and just tag the master branch with v2.0.0 as the first release under the brer package name, since the git history of https://github.com/kassonlab/brer_plugin includes the sample_restraint and myplugin commits and since we would prefer not to confuse people by implying that the first tagged release of brer is identically the package described in DOI: 10.1021/acs.jpclett.9b01407

Important notes

The plugin for DOI: 10.1021/acs.jpclett.9b01407 is from a particular branch of a fork of the sample_restraint project (currently available at https://github.com/kassonlab/brer_plugin/tree/v1). The Python package name myplugin was left unchanged from the sample code.
The current kassonlab/run_brer master branch refers to the myplugin module, per the original @jmhays code.
The current kassonlab/run_brer devel branch (slated for merge with #20 ) references the brer module for plugins. The brer module is available at the https://github.com/kassonlab/brer_plugin/ master branch.
I have neglected to compile release notes for the updates since importing these projects to kassonlab.

Questions

Is there a tag name for the sample_restraint commit associated with 10.1021/acs.jpclett.9b01407?
- Answer: https://github.com/jmhays/sample_restraint/releases/tag/v1.0.0-BRER
Is there a tag name for the run_brer package associated with 10.1021/acs.jpclett.9b01407?
- Answer: TBD, but run_brer uses versioneer.py configured to use git tags.
Is it okay to tag the @jmhays versions of the packages as 1.0 and the @kassonlab versions as 2.0?
Do we want to resolve more of the outstanding issues before
- merging #20?
- tagging a new run_brer release?
What are our thoughts on change log / release notes?

Proposals

Tag the @jmhays versions of the packages as 1.0 and the @kassonlab versions as 2.0.
Merge #20 (once brer_plugin has a tagged release for the current master branch, but before tagging run_brer 2.0).
Once
- tags are in place
- automated testing is updated,
  we can
- Apply semantic versioning policy
- Use Versioneer metadata maintenance
- Publish releases to pypi.org

Interrupted simulations cause premature phase advancement

GROMACS simulations exit without error when terminating early due to the run time max_hours parameter or even a job time-out.

run_brer allows the BRER "phase" or "iteration" to advance when the simulation exits without crashing the script. Instead, RunConfig.run() should do some checking before advancing the BRER state.

Convergence phase

For the convergence phase, we use the linearstop_restraint plugin. This code stores and uses a value associated with convergence, but doesn't currently export it.

We can add another accessor (like getTime()) to the plugin and bind a new output property.

Then, the convergence phase can read values from the plugin the same way the training phase does (e.g. alpha and target)

Then, we will use this information before updating the state.

Given that the gmxapi context is returned by the functions that perform each phase of work, we might choose to inspect the plugin output in rc.run() or in a new check... function called by rc.run() rather than in the __convergence function, since it is (at least currently) the responsibility of rc.run() to advance the phase.

Production phase

To check whether we have generated a trajectory of the prescribed length, we can add a time attribute to an additional potential, and compare its value to that obtained at the end of the convergence phase.

Also

Note that, for correctness and future-proofing, we should check whether we are in an MPI context and reduce the values from other plugins, but that might not apply to the current run_brer version, and isn't necessary for initial steps at a solution. (See #18)

Similar to #8

HPC environment optimizations can interact badly with checkpoint verification.

We have encountered some problems with the file existence checks (e.g.

run_brer/run_brer/run_config.py

Line 231 in 532e5ed

if not os.path.exists(src):

) in which newly created files are not detected, preventing correct advancement of BRER simulation phase.

On Frontera, the issue was traced all the way down to failures in the OS-level filesystem stat system call. The problem disappeared when we deactivated the Python module caching tool recommended by the HPC admins.

Though the workaround was successful, we don't have a good way to detect the underlying problem to warn users or to make our internal checking more robust.

We need some combination of documentation and Python logic to handle this situation better.

Update automated testing hooks

Several automated testing hooks should either have their references removed from the README or should be updated to use kassonlab instead of jmhays accounts.

~~Travis-CI~~ now using GitHub Actions
docs build (see also #3, #35)
~~lgtm alerts~~ (moved to kassonlab/brer-md#4)
codecov

GitHub Actions broke

(updated) CMake 3.24 has new behavior when generating its supporting files, with the result that find_package(gmxapi) is now broken if GROMACS was configured with CMake 3.24.

Ref https://gitlab.com/gromacs/gromacs/-/issues/4563

A workaround is to pip install "cmake<3.24" before doing pip install gmxapi. I will be patching GROMACS and uploading new pypi packages as soon as possible.

Support max_hours

It would be nice if we could provide the max_hours kwarg to gmx.from_tpr so that job scripts can convey the job time-out, potentially allowing for cleaner shut-down. This doesn't seem to have been a problem so far, but we can probably just improve the current situation and allow future compatibility by accepting and passing **kwargs or gmx_kwargs: dict from RunConfig through to from_tpr.

Redundant testing data.

The pair_data.json file is redundant with the raw_pair_data test fixture. One should be derived from the other (or removed).

Note that the run_brer/data was removed at some point in the git history of this repo or the sample_restraint because the sample input file could be obtained elsewhere, but we need to continue to include it for the moment in order to support all gmxapi releases.

Test both release and development plugin code.

Currently, we only test the tip of the default branch from https://github.com/kassonlab/brer_plugin

Once we tag brer_plugin 2.0, we will want to test both the tagged release and the development branch.

See .github/workflows/test.yml and ci_scripts/brer_restraint.sh

Become PEP 517/518 compliant

pip install . does not get the right version information installed, but python setup.py install does. I think this is because pip sees the setup.cfg and assumes it is sufficient for installation, so setup.py does not extract the version info through versioneer.

We know how to write a pyproject.toml and what the minimal versioneer setup.py needs to look like, so this should be straightforward to fix.

Towards ensemble workflow management

The BRER ensemble member number handling needs some generalization for forward compatibility with ensemble-ready workflow management. I'll follow up more later, but one point I don't want to forget is that the traditional gmxapi 0.0.7 MPI ensemble executor maps the MPI rank to its notion of ensemble member. The BRER ensemble member is not necessarily always the index of the running simulation, since some BRER ensemble members may not be running in a particular job. (TODO: decouple hard-wired relationships between running task index and ensemble member ID, whereever such things pop up.) Note that this hasn't been a problem so far because we have been using job array IDs on systems where we can specify the list of array elements at job submission time. (This works great, but is far from general.)

ValueError doesn't seem to fit the surrounding logic.

Ref:

run_brer/run_brer/run_data.py

Line 106 in 532e5ed

'You have provided a name; this means you are probably trying '

The error message starts by asserting that the caller has named a specific pair, but the enclosing conditional is not name.

The structure of the conditional logic and the text of the message have remained consistent since the original commit b9b750a

Presumably, there is an error in the wording of the message, but we should examine carefully before updating.

Reconcile index.rst with README.md

I just noticed that these have largely overlapping content, but have diverged with lack of attention to index.rst.

I'm not sure what the best solution is. If we want the content of README to be included in https://kassonlab.github.io/run_brer/, we could try to convert README.md to README.rst and inline it directly into index.rst.

If we want to keep them separate, we should consider whether some content should be deduplicated to reduce maintenance burden. But we might just need to do a better job of keeping both up-to-date.

Thoughts?

Move restraint plugin repository to kassonlab space

Create a https://github.com/kassonlab/brer_plugin repository
Move this issue to the new repository, along with the pending issues below.
Transfer the sample_restraint fork from @jmhays to https://github.com/kassonlab/brer_plugin
Update external links.
Update sample_restraint artifacts to clarify that the repository is for the brer plugin. Update README, project name, docs, tests. Increment the project version to aid with clarity.

Pending issues for renewed repository

Warning or Error when alpha parameter is 0.
Update for gmxapi 0.1+

Update documentation links

Documentation builds are no longer built (or updated) by @jmhays

We should

Establish hosting for documentation through a kassonlab account, and
Update links in the repository to point to the new authoritative documentation URLs.

state.json overwritten with new trial start

state.json gives an overview of the BRER run for a given member. When a member restarts BRER with a new trial (ex. mem_1/0 > mem_1/1), the state file is overwritten with the new trial's information. It would be nice if we could maintain the run state for every trial (mainly for readability).

Merge test_production_bootstrap into test_run_config

The test adds several minutes in order to test an option that requires a run in the production phase.

Prior to resolving issue #19, it was non-trivial to use the max_hours keyword argument in conjunction with multiple trajectory segments in the same production phase.

With resolution of #19, we can perform several short simulations in the production phase, including the bootstrap test, reducing our testing time.

kassonlab / run_brer Goto Github PK

run_brer's People

Contributors

Watchers

Forkers

run_brer's Issues

Notes on legacy version support

Example case

Avoiding version mismatch problems

Important notes

Questions

Proposals

Convergence phase

Production phase

Also

Pending issues for renewed repository

Recommend Projects

Recommend Topics

Recommend Org