wehs7661 / ensemble_md Goto Github PK
View Code? Open in Web Editor NEWA python package for performing GROMACS simulation ensembles
License: MIT License
A python package for performing GROMACS simulation ensembles
License: MIT License
We should enable different simulation input files, including the GRO and TOP files for different replicas. This is a feature especially required in EEXE simulations for multiple serial mutations because different replicas corresponding to different mutations require different TOP files and have different GRO files at λ = 0. Note that, as mentioned in issue #15, different replicas are bound to different changes in chemistry, in which case we shouldn't need to swap the TOP files, but just the GRO files (done by coordinate manipulation).
Note that this issue is a part of the work in the project EEXE for serial mutations.
In the updated theory, there is only one way to calculate the acceptance ratio assuming symmetric proposal probabilities. Previously we removed the option metropolis_eq
/metropolis-eq
from the YAML parameter acceptance
and we decided to remove the state-state swapping scheme, which leaves only one option (metropolis
) for the parameter acceptance
and makes the parameter acceptance
not necessary anymore. Therefore, the YAML parameter acceptance
should be removed and the documentation should be updated accordingly.
Here we list todos that are not currently urgent but could be important for future work:
msm_analysis.py
and write unit tests for its functionalities. Expand its functionalities if necessary.To expand the usage of EEXE, we want to enable coordinate manipulation at exchanges between replicas, which is most likely to be useful for estimating the free energy of multiple serial mutations using expanded ensemble simulations, such as mutating methane into ethane and then propane.
For example, we can have an EEXE simulation composed of two replicas mutating methane into ethane and ethane into propane, respectively, and only exchange the coordinates between replicas when they are at the end states, i.e., replica 1 being at λ=1 and replica 2 being at λ=0. In this example, we will have the following end states:
At exchanges, we will have two output gro
files respectively from replicas 1 and 2, namely rep1.gro
(state b, ethane with a dummy H atom at the first carbon) and rep2.gro
(state c, ethane with a dummy ethyl group at the second carbon).
Note that in EEXE, each replica is bound to the transformation for its assigned alchemical range. In our case, this means that replica 1 will only be responsible for the mutation of a methane to an ethane, and replica 2 will only be responsible for mutating an ethane to a propane. Normally, we would just swap the gro
files as is, so in the next iteration, replica 1 will be initialized with rep2.gro
and sample the intermediate states along the mutation path between methane and ethane. However, rep2.gro
is an ethane with a dummy methyl group, not an ethane with a dummy H atom that we need for such sampling. The same thing would happen when trying to initialize the next iteration of replica 2 using rep1.gro
.
To address this issue, we can modify rep2.gro
as follows and use it to proceed to the next iteration of replica 1:
rep2.gro
.rep2.gro
. Specifically, the coordinate of the dummy H atom can just be the coordinates of the second carbon atom. There won't be clashes since the dummy H atoms have no interactions with the rest of the system.Similarly, we can modify rep1.gro
as follows for the next iteration of replica 2:
rep1.gro
.rep1.gro
. Specifically, we can take the internal coordinates of the methyl group in rep2.gro
, treat the group as rigid, rotate, and attach the group to the second carbon atom in rep1.gro
.Importantly, we can make the two modified gro
files have the same potential energy, so the proposed exchange will always be adopted.
Here, we are not going to implement functions for coordinate manipulation in EEXE but modify the CLI run_EEXE
(and the function run_grompp
in ensemble_EXE.py
, if necessary) to allow the flexibility of calling a user-defined function for coordinate manipulation from an input python module (where the user-defined function is defined).
The current theory section in the documentation is outdated and a part of it is wrong. Should update it ASAP.
Before the next release (presumably 0.5.0), one of the action items is to improve the quality of existing unit tests and covered the untested functions. Here is a checklist for the modules to test. An item is checked only if the existing tests have been re-examined and new tests have been added (when necessary) to increase the code coverage.
ensemble_EXE.py
(finished at commit 09adb37)utils/utils.py
(commits d83e5a)utils/gmx_parser.py
analysis/analyze_matrix.py
analysis/analyze_traj.py
analysis/analyze_free_energy.py
analysis/msm_analysis.py
Notably, there should not be a need to cover the CLIs and utils/exceptions.py
. We do need to re-examine the CLIs to make sure they are all sufficiently compartmentalized, so I list them below as well.
run_EEXE.py
analyze_EEXe.py
explore_EEXE.py
Here we list functionalities in the package that will be removed/deprecated soon:
final
for the YAML parameterw_combine
The documentation will also be updated accordingly.
Here we list the todos that should be done before the next release of the ensemble_md
package. The release number has not been finalized and can be either 0.10.0
or 1.0.0
depending on the status of the paper review. Overall, the todos are to improve the code coverage and the documentation of the package. This issue includes updated components from the plans mentioned in issues #7, #26, and #33, as parts of them have become outdated.
Note that we do not include adding unit tests for analysis/msm_analysis.py
since MSM analysis for REXEE is a WIP.
analysis/analyze_free_energy.py
analysis/clustering.py
analysis/synthesize_data.py
test_mpi_func.py
, which tests functions that use MPIreplica_exchange_EE.py
Potentially, we might want to consider adding regression tests for the CLIs.
Potentially, we might want to add some simple tutorials before the next release.
In the commit c661fb8, we have enabled the use of MPI-enabled GROMACS and disabled the use of thread-MPI GROMACS. However, as discussed in issue #10 , in the original implementation, the execution of multiple grompp commands was parallelized by mpi4py
using the conditional statement if rank < self.n_sim:
, while in the new implementation that allows MPI-enabled GROMACS, the GROMACS tpr files are generated serially using the following lines in the function run_grompp
:
# Run the GROMACS grompp commands in parallel
for i in range(self.n_sim):
print(f'Generating the TPR file for replica {i} ...')
returncode, stdout, stderr = self.run_gmx_cmd(args_list[i])
if returncode != 0:
print(f'Error:\n{stderr}')
sys.exit(returncode)
I tested the code with a fixed-weight EEXE simulation for the anthracene system with nst_sim=1250
and n_sim
on Bridges-2 with 64 cores requested and runtime_args={'-ntomp': '1'}
and n_proc=64
. As a result, 50 iterations took 327 seconds. This is much longer than the original implementation that used mpi4py
to parallelize the GROMACS grompp commands. Specifically, using the new implementation, 20000 iterations would take approximately 327 * 400 seconds, which is around 36 hours, much longer than 13 hours required to finish the same simulation using the original implementation.
In light of this, we should figure out a way to parallelize the GROMACS grompp commands without introducing any noticeable overhead and without using mpi4py
(to prevent nested MPI calls). Also, we should try to identify if there is any other source that contributed to higher computational costs in the new implementation of EEXE.
As a part of issue #41, here is a checklist for keeping track of the progress of proofreading docstrings of all modules in ensemble_md
.
replica_exchange_EE.py
utils
exceptions.py
gmx_parser.py
utils.py
analysis
analyze_traj.py
analyze_matrix.py
analyze_free_energy.py
clustering.py
synthesize_data.py
msm_analysis.py
cli
explore_REXEE.py
For unknown reasons, the unit tests for functions that use MPI occasionally fail with uninformative STDOUT/STDERR. Instead of figuring out the underlying reason for these failures, an easier way to ensure that at least the build on the master branch passed the CI may be simply rerunning the CI. However, this would require some modifications in config.yaml
. Some instructions can be found here.
The current implementation does not support extension/continuation of REXEE simulations, which should be added shortly.
In free energy calculations for a REXEE simulation using the current implementation, the preprocessed data pickled do not support recalculations with different subsampling fractions and can therefore fail when subsampling_avg = True
. This is because when subsampling_avg=True
is set, variables like t_idx_list
and g_list
are required to calculate the input t
and g
values for preprocess_data
, but these variables are not pickled. While this can be resolved by simply removing the pickled data and rerunning the entire calculation, codes should be modified so that the pickled data can also work with cases with subsampling_avg = True
.
In the current implementation, the weights of EXE replicas in a fixed-weight EEXE simulation are initialized from the same set of weights defined in the template MDP file. For example, say that the weights corresponding to the full alchemical path that covers 4 states include 0, 0.5, 1.5, and 3.5. Then for the first replica that spans 3 states, the weights would be 0, 0.5, and 1.5, and those for the second replica would be 0, 1.0, 3.0 (after shifting the weight of the first state to 0).
However, the user should have the choice to have different sets of weights for states in different replicas. For example, one could have 0., 0.5, and 1.5 for the first replica, but have 0, 0.5, 3.0 for the second replica (instead of 0, 1.0, 3.0). This could be useful for, but not limited to EEXE simulations for multiple serial mutations and can possibly be done by inputting multiple MDP files or specifying the list of weights in the YAML file.
This issue is a part of the work in the project EEXE for serial mutations.
As discussed in issues #10, and #13, the flag -multidir
in mdrun
can be used to enable MPI-enabled GROMACS, but the overhead could be high due to longer GROMACS start times. However, in cases where the exchange frequency is not too high so the introduced overhead is affordable (e.g. in EEXE simulations for multiple serial mutations), enabling MPI-enabled GROMACS might still be useful.
To make our EEXE implementation work with both thread-MPI GROMACS and MPI-enabled GROMACS, we decided to have two CLIs for running EEXE simulations, including the original CLI run_EEXE
that works for thread-MPI GROMACS, and the CLI run_EEXE_mpi
that we aim to develop here to work with MPI-enabled GROMACS. These two CLIs will be mostly the same, except that run_EEXE_mpi
will not use mpi4py
to avoid nested MPI calls. Some functions in ensemble_EXE.py
may need to be modified to work with both thread-MPI GROMACS and MPI-enabled GROMACS, or, when this is not possible, functions specifically for MPI-enabled GROMACS will be added.
This issue is a part of the work in the project EEXE for serial mutations.
The information about the simulation is not printed on-the-fly to run_REXEE_log.txt
. It is only printed to the file after the simulation is completed. If the simulation is interrupted, the content of run_REXEE_log.txt
will be empty. This should be corrected, presumably by modifying Logger
in utils.py
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.