luigibonati / mlcolvar Goto Github PK

View Code? Open in Web Editor NEW

89.0 89.0 23.0 99.29 MB

A unified framework for machine learning collective variables for enhanced sampling simulations

License: MIT License

Shell 0.88% Python 99.12%

collective-variables data-driven enhanced-sampling machine-learning python

mlcolvar's People

Contributors

Stargazers

Watchers

Forkers

nicolopedrani yuhaihaiyu lmuellender andrrizzi lgtm-migrator thangckt sperego98 pietronvll pradipchm byun-jinyoung shub-bioinfo souvik-snh bylehn hadriennu charlotte-chaps esenmira jintuzhang wankiwi recisic

mlcolvar's Issues

An issue with deep-TICA on reducing the descriptor set

https://groups.google.com/g/plumed-users/c/LQJiFVrYdzE
We have successfully trained a deep-TICA model using the .rbias, which is suitable for our protein-ligand system. We selected 4536 descriptors, representing the interatomic distances between heavy atoms. However, during simulations with these 4536 descriptors, we observed lower computational efficiency. Similar to the Chignolin Folding case you mentioned in your article "Deep learning the slow modes for rare events sampling," we read your suggestions on reducing the descriptor set.
In your article, you mentioned reducing the number of descriptors to 210 by selecting the most relevant ones through sensitivity analysis of the primary CVs. We referred to your article and code https://colab.research.google.com/drive/1dG0ohT75R-UZAFMf_cbYPNQwBaOsVaAA#scrollTo=05ARhiNhSI_D and encountered some issues during testing. We hope you can provide assistance:
1、We faced issues in the variance calculation part and are unsure if the script is suitable for deep-TICA data. How should we modify it to adapt to deep-TICA data?

compute std to correct for different magnitudes

standardize_inputs = True #@param {type:"boolean"}
if multiply_by_stddev:
if standardize_inputs:
dist2 = (dist - Mean) / Range
else:
dist2 = dist
in_std = np.std(dist2, axis=0)
2、We encountered problems in the weight summation part of the function, specifically, we found that
model.nn[0].weight[:,i].abs().sum().item() throws an error: "TypeError: 'FeedForward' object is not subscriptable." Could you guide us on how to resolve this issue?
3、Could you share the code mentioned in the Chignolin Folding case in the article "Deep learning the slow modes for rare events sampling" applicable to reducing the descriptor set for deep-TICA data?
We appreciate your assistance and look forward to your guidance. Meanwhile, we are sharing our test code with you for a better understanding of our issues.
test-code.zip

codecov not working

local exec of codecov returns correct test coverage, while the one running on CI reports 0%

update doc

install instructions (pip + source)
add contribution page
change license to MIT (change license file and in pyproject)
add references within CVs docstrings

early stopping crashes if not used

Distinguish training_step from evaluate_loss for better inheritance?

Btw, the fact that partial results are hidden inside training_step() seems a common problem for inheritance. Because of this, we need to evaluate the encoder twice in each step here. In the future, we might think of a way to make this easier, like having the CVs implementing a evaluate_loss() method that takes a bunch of variables that currently have scope only within training_step() (e.g., the result of the encoder).

Originally posted by @andrrizzi in #62 (comment)

MultiTaskCV: Save lists of auxiliary losses in ModuleList

Currently they auxiliary loss functions are saved in a normal list, but we could save them in a ModuleList. I think this way if the loss function has trainable parameters (e.g., a decoder for reconstruction losses) they will be optimized as well.

incompatibility between kdepy and latest scipy

CI is failing due to an error in compute_fes, which is thrown by FFTKDE, but originally by scipy.brentq function.

since this started appearing in scipy=1.11.*, i would add a previous version in the requirements until this is fixed in either kdepy/scipy

[nn] add option to take a torch.nn.module to build a nncv

rename forward_blocks in forward_cv

@andrrizzi and I thought this to improve clarity

This is because this method is the one which:

returns the CV
might be overloaded when implementing a new cv, with or without using blocks sequential features

Add feature_names attribute to DictDataset

[docs] add notebooks

Introduction

Getting started
Creating datasets
Customize loss and optimizer

NN-based CVs

Autoencoder
Deep-LDA
Deep-TDA
Deep-TICA

Linear models

LDA
TICA

Customize

Multitask
Preprocessing
Adding a new cv from scratch
Adding a new CV by overload

make nomenclature pep8 compliant

Some things that @andrrizzi suggested to clean up:

Breaking changes:

CV names: Xyz_CV --> XyzCV
normIn --> norm_in (in cvs blocks)
MSE_loss --> mse_loss
TDA_loss --> tda_loss

Internal changes:

TICA member: (reg_c0 -> reg_C_0)
Normalization: members ('Mean','Range')
Statistics: members ('Mean','Std','Min','Max')
Transforms: check all

rename loss/optim options in kwargs

To improve clarity, also within training_step where we combine options with other args from data (e.g. weights)

use fast tensor dataloader in deeplda when dataset is given

LabeledDataset --> TensorDataset
Dataloader --> FastTensorDataloader

Transforms acting on positions

I create a transform branch and move there the transform functions which act directly on the atomic positions, waiting for a more stable API

Problems in compiling torch module in plumed

Pytorch==1.13.1

In the beginning, I start to compile the torch module of plumed with libtorch-cxx11-abi-shared-with-deps-1.13.1+cpu, then I found the error "undefined reference to ‘powf@GLIBC_2.27'".

As it is not wise to update GLIBC, I then try to use libtorch-1.13.1 without ABI, then another error occurred:

"""collect2: error: ld returned 1 exit status
    Makefile:499: recipe for target 'plumed' failed
    make[4]: *** [plumed] Error 1
    make[4]: Leaving directory '/home/jyzha/tmp/plumed-2.9.0/src/lib'
    Makefile:110: recipe for target 'all' failed
    make[3]: *** [all] Error 2
    make[3]: Leaving directory '/home/jyzha/tmp/plumed-2.9.0/src/lib'
    Makefile:8: recipe for target 'lib' failed
    make[2]: *** [lib] Error 2
    make[2]: Leaving directory '/home/jyzha/tmp/plumed-2.9.0/src'
    Makefile:33: recipe for target 'lib' failed
    make[1]: *** [lib] Error 2
    make[1]: Leaving directory '/home/jyzha/tmp/plumed-2.9.0'
    Makefile:21: recipe for target 'all' failed
    make: *** [all] Error 2"""

In fact, I have tried to install GLIBC-2.27 in my own path. However, it throws out a weird error:

"""make -r PARALLELMFLAGS="" -C .. objdir=`pwd` all
    make[1]: Entering directory `/home/jyzha/tmp/glibc-2.27'
    make  subdir=csu -C csu ..=../ subdir_lib
    make[2]: Entering directory `/home/jyzha/tmp/glibc-2.27/csu'
    /usr/bin/install -c -m 644  /home/jyzha/tmp/glibc-2.27/build/cstdlibT
    /usr/bin/install: missing destination file operand after ‘/home/jyzha/tmp/glibc-2.27/build/cstdlibT’
    Try '/usr/bin/install --help' for more information."""

I am looking forward to your advice on installing the torch module. I strongly require this tool in my work and I am also willing to contribute to mlcolvar. Thank you~

[nn] Fix memory issue on GPU

Calculate averages and std dev for input/output normalization at the batch level

Training without validation set crashes

At the moment, the training crashes when DictModule does not split the dataset into training and validation because Lightning always calls val_dataloader().

model.to_torchscript difficult to find in tutorial notebooks

The function model.to_torchscript is difficult to find in tutorial notebooks.
They should be improved.

Support multiple datasets of different lengths

Currently, FastDictionaryLoader support multiple datasets only if they have the same number of samples. We should look into whether it's possible to remove this limitation as it is probably the most common case.

Move all test to test folder + pytest parametrization

Up to now, some tests are at the end of the specific files and some are defined in the test folder.

We should move all of them to the test folder and use whenever needed the pytest parametrize functions

Walker column in utils.io.load_dataframe function and create_dataset_from_files

What is the use of the walker column in utils.io.load_dataframe? Is it needed?

         # check if file is in PLUMED format
        if is_plumed_file(filename):
            df_tmp = plumed_to_pandas(filename)
            df_tmp['walker'] = [i for _ in range(len(df_tmp))]
            df_tmp = df_tmp.iloc[start:stop:stride, :]
            df_list.append( df_tmp )
            
        # else use read_csv with optional kwargs
        else:
            df_tmp = pd.read_csv(filename, **kwargs)
            df_tmp['walker'] = [i for _ in range(len(df_tmp))]
            df_tmp = df_tmp.iloc[start:stop:stride, :]
            df_list.append( df_tmp )

I tried and it seems like it only returns a column of zeros

In case it should be kept, create_dataset_from_files should be modified to automatically exclude that column by default as it does with time and labels. Otherwise, when filter_args = None it loads the (useless) walker column

Helper functions for time-lagged datasets

Helper function to create a time-lagged dataset from a file
reweight time-lagged pairs based on exp(beta*V) and not rescaling time

custom loss and optim

uniform losses signature (args + dict)
add set_optimizer and set_loss_fn to CVutils

new release on pypi

We should make a release to be uploaded on pypi

update documentation (see #74)
add requirements into pyproject.toml
add versions also to requirements.txt?
update installation instructions for colab
code formatted with black
decide which version (major/minor/patch) to release? 0.3.1 or 0.4.0 or 1.0.0alpha?

fix MSE_loss

Change signature from (diff) to (y,y_ref)
Enforce check on the shape of y and y_ref

Notebooks tests failing on MacOs due to exceeded time limit

Some notebook tests fail on MacOs because some cell takes more than 300s to finish.
It happens randomly (sometimes it does sometimes it doesn't) but still it's annoying.

Should we check this in the tutorials and maybe use fake values and comment the real ones?

Implement learning rate scheduler

LR scheduler is already in utils.optim but it is not implemented in NNCV base class

DeepTDA cannot be loaded from checkpoint

Loading a DeepTDA CV from a checkpoint does not work:

Minimal (non)working example:

from lightning.pytorch.callbacks.model_checkpoint import ModelCheckpoint
checkpoint = ModelCheckpoint(save_top_k=1,  monitor="valid_loss")

trainer = pl.Trainer(callbacks=[checkpoint],enable_checkpointing=True)
trainer.fit( model, datamodule )

best_model = DeepTDA.load_from_checkpoint(checkpoint.best_model_path)

given an error in initialization:

File [~/software/mambaforge/envs/mlcolvar/lib/python3.10/site-packages/mlcolvar/cvs/supervised/deeptda.py:67](https://file+.vscode-resource.vscode-cdn.net/home/lbonati%40iit.local/work/simulations/sampl5/OAMe/G2/~/software/mambaforge/envs/mlcolvar/lib/python3.10/site-packages/mlcolvar/cvs/supervised/deeptda.py:67), in DeepTDA.__init__(self, n_states, n_cvs, target_centers, target_sigmas, layers, options, **kwargs)
     35 def __init__(
     36     self,
     37     n_states: int,
   (...)
     43     **kwargs,
     44 ):
     45     """
     46     Define Deep Targeted Discriminant Analysis (Deep-TDA) CV composed by a neural network module.
     47     By default a module standardizing the inputs is also used.
   (...)
     64         Set 'block_name' = None or False to turn off that block
     65     """
---> 67     super().__init__(in_features=layers[0], out_features=layers[-1], **kwargs)
     69     # =======   LOSS  =======
     70     self.loss_fn = TDALoss(
     71         n_states=n_states,
     72         target_centers=target_centers,
     73         target_sigmas=target_sigmas,
     74     )

TypeError: mlcolvar.cvs.cv.BaseCV.__init__() got multiple values for keyword argument 'in_features'

we should also check the other CVs and add regtests for this feature (as of now only regressionCV was tested in this notebook: https://mlcolvar.readthedocs.io/en/stable/notebooks/tutorials/intro_3_loss_optim.html#Model-checkpointing)

When validation set is disabled metrics callbacks are not returned

If the validation set is disabled the metrics are not returned anymore

Metrics are updated at the end of the validation epoch, as implemented in mlcolvar.utils.trainer.MetricsCallback

class MetricsCallback(Callback):

Lightning callback which saves logged metrics into a dictionary.
The metrics are recorded at the end of each validation epoch.

   def __init__(self):
        super().__init__()
        self.metrics = {"epoch": []}

    def on_validation_epoch_end(self, trainer, pl_module):
        metrics = trainer.callback_metrics
        if not trainer.sanity_checking:
            self.metrics["epoch"].append(trainer.current_epoch)
            for key, val in metrics.items():
                val = val.item()
                if key in self.metrics:
                    self.metrics[key].append(val)
                else:
                    self.metrics[key] = [val]

rename pytorch_lightning

pytorch_lightning has been renamed to just lightning

Lightning-AI/pytorch-lightning#16688

[tica] parallelize search for time lagged configurations

allow to specify different input and target data in ae-based cvs

AttributeError: 'DeepLDA_CV' object has no attribute 'loss_train'

Dear Developers,

I try to run a tutorial provide in this link: https://mlcvs.readthedocs.io/en/latest/notebooks/ala2_deeplda.html
After training the model, I try to access the attribute loss_train
Then the code raises error:

... \py39mlcvs\lib\site-packages\torch\nn\modules\module.py:1269, in Module.__getattr__(self, name)
   1267     if name in modules:
   1268         return modules[name]
-> 1269 raise AttributeError("'{}' object has no attribute '{}'".format(
   1270     type(self).__name__, name))

AttributeError: 'DeepLDA_CV' object has no attribute 'loss_train'

May be that attribute has not been implemented yet.

[tica] create time lagged dataset does not accept dataframe

Add lr_scheduler options in BaseCV.configure_optimizers

The implementation of the configure_optimizer method of BaseCV apparently doesn't allow the use of a lr_scheduler.

This could be included in the 'optimizer' options maybe, for example, in a quick and dirty way

def configure_optimizers(self):
        """
        Initialize the optimizer based on self._optimizer_name and self.optimizer_kwargs.

        Returns
        -------
        torch.optim
            Torch optimizer
        """
        lr_scheduler_dict = self.optimizer_kwargs.pop('lr_scheduler', None)
    
        optimizer = getattr(torch.optim, self._optimizer_name)(
            self.parameters(), **self.optimizer_kwargs
        )
        if lr_scheduler_dict is not None:
            lr_scheduler_name = lr_scheduler_dict.pop('scheduler')
            lr_scheduler = {
                'scheduler': lr_scheduler_name(optimizer, **lr_scheduler_dict),
            }
            lr_scheduler_dict['scheduler'] = lr_scheduler_name
            self.optimizer_kwargs['lr_scheduler'] = lr_scheduler_dict
            return [optimizer] , [lr_scheduler]
        else: 
            return optimizer

Bug in building DictModule with old version of PyTorch.

When using old versions of PyTorch (e.g., 1.10), building a mlcolvars.data.DictModule may cause the following error:

raise ValueError("Sum of input lengths does not equal the length of the input dataset!").

And this is caused by the change of the torch.utils.data.random_split method:

random_split In PyTorch 2.1:

def random_split(dataset: Dataset[T], lengths: Sequence[Union[int, float]],
                 generator: Optional[Generator] = default_generator) -> List[Subset[T]]:
    r"""
    Randomly split a dataset into non-overlapping new datasets of given lengths.

    If a list of fractions that sum up to 1 is given,
    the lengths will be computed automatically as
    floor(frac * len(dataset)) for each fraction provided.

    After computing the lengths, if there are any remainders, 1 count will be
    distributed in round-robin fashion to the lengths
    until there are no remainders left.

    Optionally fix the generator for reproducible results, e.g.:

    Example:
        >>> # xdoctest: +SKIP
        >>> generator1 = torch.Generator().manual_seed(42)
        >>> generator2 = torch.Generator().manual_seed(42)
        >>> random_split(range(10), [3, 7], generator=generator1)
        >>> random_split(range(30), [0.3, 0.3, 0.4], generator=generator2)

    Args:
        dataset (Dataset): Dataset to be split
        lengths (sequence): lengths or fractions of splits to be produced
        generator (Generator): Generator used for the random permutation.
    """

random_split In PyTorch 1.10:

def random_split(dataset: Dataset[T], lengths: Sequence[int],
                 generator: Optional[Generator] = default_generator) -> List[Subset[T]]:
    r"""
    Randomly split a dataset into non-overlapping new datasets of given lengths.
    Optionally fix the generator for reproducible results, e.g.:

    >>> random_split(range(10), [3, 7], generator=torch.Generator().manual_seed(42))

    Args:
        dataset (Dataset): Dataset to be split
        lengths (sequence): lengths of splits to be produced
        generator (Generator): Generator used for the random permutation.
    """

This method is invoked by

mlcolvar/mlcolvar/data/datamodule.py

Line 211 in e356f24

def _split(self, dataset):

Apparently, the _split method passes dataset length fractions to the random_split method, but the old random_split method only accepts explicit dataset lengths as parameters. Thus, it may be reasonable to modify the code to pass actual data lengths.

why not a standard form for linear model?

Dear @luigibonati

Can i have a silly question about the linear model implemented in your code at: https://github.com/luigibonati/mlcvs/blob/main/mlcvs/models/linear.py

You used

s = torch.matmul(X - self.b, self.w)

instead of a standard form y = weight * X + bias

Is there any specific purpose for that?

fix loss_options init in BaseCV / child cvs

right now, loss_options is

defined as an empty dict in BaseCV
updated with the values contained in options kwargs in sanitize_options
reset to the default values per class in child class initialization

so we need to change the order between 2 and 3

create general dataloader from file

handle both labeled and unlabeled data

Missing installation info on repo main page

On the main page of the repository, the installation info is missing.

We should add:

Install with pip
Install cloning from GitHub

possible to apply for per-atom descriptors?

Dear Developers.

Do you plan to update MLCVS that can allow to apply for per-atoms descriptors?
I mean the descriptors are computed as per-atom vectors than a global vector for a whole system

Thank you so much

Migrate from lgtm to GitHub code scanning

lgtm.com is being shut down, need to migrate to code scanning feature

[nn] Normalization not working after loading model from checkpoint

missing documentation for DeepTDA

@EnricoTrizio

Preprocessing variables not defined as buffers of model

When preprocessing is included in the model, preprocessing-related variables, which should be buffers for the model, are not passed as they should. This raises problems when we try to do model.to(dtype/device).

We could change the way preprocessing is passed to the model. Instead of as an attribute, we can use a property and define a set_preprocessing method which automatically loads the buffers of the preprocessing module into the model one (and remove them if preprocessing is set to None again).

We should also modify the Transform classes in the transform branch accordingly to have the needed variables saved as buffers, of course.

What do you think @andrrizzi?

Maybe we can also use the new approach to set the in/out_features and the example_input in the CV model in case of pre/postprocessing, maybe

if the rank is equal to the number of features (or larger) it should fall back to least_squares, however, I think that as it is written now it just skips the calculation. Indeed, once it enters the first if statement it does not enter also in the following elif.
even when selecting rank < num features the algorithm throws an error. This is the result of executing the test_reduced_rank_tica function:

Traceback (most recent call last):
  File "/home/[email protected]/work/code/mlcvs/mlcolvar/tests/test_core_stats_tica.py", line 7, in <module>
    test_reduced_rank_tica()
  File "/home/[email protected]/work/code/mlcvs/mlcolvar/core/stats/tica.py", line 190, in test_reduced_rank_tica
    tica.compute([x_t,x_lag],[w_t,w_lag], save_params=True, algorithm='reduced_rank')
  File "/home/[email protected]/work/code/mlcvs/mlcolvar/core/stats/tica.py", line 89, in compute
    evals, evecs = reduced_rank_eig(C_0, C_lag, self.reg_C_0, rank = self.out_features)
  File "/home/[email protected]/work/code/mlcvs/mlcolvar/core/stats/utils.py", line 82, in reduced_rank_eig
    _, idxs = torch.topk(vectors.values, rank)
TypeError: topk(): argument 'input' (position 1) must be Tensor, not builtin_function_or_method

@pietronvll can you check this? thanks!