chemprop / chemprop Goto Github PK

View Code? Open in Web Editor NEW

1.6K 1.6K 543.0 724.53 MB

Message Passing Neural Networks for Molecule Property Prediction

Home Page: https://chemprop.csail.mit.edu

License: Other

Python 98.74% Dockerfile 0.43% Shell 0.83%

chemistry drug-discovery machine-learning neural-networks

chemprop's Introduction

Chemprop

Chemprop is a repository containing message passing neural networks for molecular property prediction.

Documentation can be found here.

There are tutorial notebooks in the examples/ directory.

Chemprop recently underwent a ground-up rewrite and new major release (v2.0.0). A helpful transition guide from Chemprop v1 to v2 can be found here. This includes a side-by-side comparison of CLI argument options, a list of which arguments will be implemented in later versions of v2, and a list of changes to default hyperparameters.

License: Chemprop is free to use under the MIT License. The Chemprop logo is free to use under CC0 1.0.

References: Please cite the appropriate papers if Chemprop is helpful to your research.

Chemprop was initially described in the papers Analyzing Learned Molecular Representations for Property Prediction for molecules and Machine Learning of Reaction Properties via Learned Representations of the Condensed Graph of Reaction for reactions.
The interpretation functionality (available in v1, but not yet implemented in v2) is based on the paper Multi-Objective Molecule Generation using Interpretable Substructures.
Chemprop now has its own dedicated manuscript that describes and benchmarks it in more detail: Chemprop: A Machine Learning Package for Chemical Property Prediction.
A paper describing and benchmarking the changes in v2.0.0 is forthcoming.

Selected Applications: Chemprop has been successfully used in the following works.

A Deep Learning Approach to Antibiotic Discovery - Cell (2020): Chemprop was used to predict antibiotic activity against E. coli, leading to the discovery of Halicin, a novel antibiotic candidate. Model checkpoints are availabile on Zenodo.
Discovery of a structural class of antibiotics with explainable deep learning - Nature (2023): Identified a structural class of antibiotics selective against methicillin-resistant S. aureus (MRSA) and vancomycin-resistant enterococci using ensembles of Chemprop models, and explained results using Chemprop's interpret method.
ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries: Chemprop was trained on 41 absorption, distribution, metabolism, excretion, and toxicity (ADMET) datasets from the Therapeutics Data Commons. The Chemprop models in ADMET-AI are available both as a web server at admet.ai.greenstonebio.com and as a Python package at github.com/swansonk14/admet_ai.
A more extensive list of successful Chemprop applications is given in our 2023 paper

Version 1.x

For users who have not yet made the switch to Chemprop v2.0, please reference the following resources.

v1 Documentation

Documentation of Chemprop v1 is available here. Note that the content of this site is several versions behind the final v1 release (v1.7.1) and does not cover the full scope of features available in chemprop v1.
The v1 README is the best source for documentation on more recently-added features.
Please also see descriptions of all the possible command line arguments in the v1 args.py file.

v1 Tutorials and Examples

Benchmark scripts - scripts from our 2023 paper, providing examples of many features using Chemprop v1.6.1
ACS Fall 2023 Workshop - presentation, interactive demo, exercises on Google Colab with solution key
Google Colab notebook - several examples, intended to be run in Google Colab rather than as a Jupyter notebook on your local machine
nanoHUB tool - a notebook of examples similar to the Colab notebook above, doesn't require any installation
- YouTube video - lecture accompanying nanoHUB tool
These slides provide a Chemprop tutorial and highlight additions as of April 28th, 2020

v1 Known Issues

We have discontinued support for v1 since v2 has been released, but we still appreciate v1 bug reports and will tag them as v1-wontfix so the community can find them easily.

chemprop's People

Contributors

Stargazers

Watchers

Forkers

wengong-jin yongleli andy-d-palmer bp-kelley conduitnetwork yangkevin2 driesvr jamel-mes xiaoxiongwewe chemphy cgrambow sciexpem lorybaby lilleswing acmater fischer70 yashkhem1 greatlse yunsiechung abhik1368 boyuezhong aspirincode phenylazide wibrow highdxy boston123456 tingweidaniel oscarwumit dranasinghe littlesuncaicai gscalia connorcoley pawansit mufeili divassanwal mmcdermott sailfish009 aclyde11 qize allisontam divide-by-0 rmovva maforsuelo whoyouwith91 amoliu hcji nanomolar hehuanma yanfeiguan andrewpalmerbasf omerch sunflower6069 sudouodo mza0150 catenate15 yufengwhy rdpapworth yeswici neerajsirvisetti drsami linhduongtuan hatrix233 hassanmohsin ridasilva inspectordidi jiaxuanyou tonydeep hl-henry wangdi2014 luispacs briggsly dakkki alighofrani95 cecilepereiratotal suleymanov wanzinyazar nitin0301 kongtuotuo gracewang723 haonan-zhang laksh1997 zhenghl2 stjordanis fl65inc eileenhsieh onejune2018 rajatguptakgp shunsunsun egottschalk mengjintao tytcc liangzai951 iwwwish guillecg cajanond swyzzwh lhirschfeld ajfisch lipi12q hanjie1633

chemprop's Issues

Can't pickle local object/EOF error during predict.py

I am trying to run the predict.py script on a trained model, but am getting an error after loading the data and pretrained parameters. I'm running on Windows 10 with a CPU. I've been able to use previous versions of chemprop with no issues, but I updated to the newest version recently and ran into this one. Thanks

Update - I get same error when running interpret.py.

Loading training args
Loading data
21it [00:00, ?it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████| 21/21 [00:00<?, ?it/s]
Validating SMILES
Test size = 21
Predicting with an ensemble of 1 models
0%| | 0/1 [00:00<?, ?it/s]Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".

0%| | 0/1 [00:00<?, ?it/s]
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "predict.py", line 8, in
make_predictions(args)
File "C:\chemprop\chemprop\train\make_predictions.py", line 87, in make_predictions
scaler=scaler
File "C:\chemprop\chemprop\train\predict.py", line 28, in predict
for batch in tqdm(data_loader, disable=disable_progress_bar):
File "\anaconda3\envs\chemprop\lib\site-packages\tqdm\std.py", line 1129, in iter
for obj in iterable:
File "C:\chemprop\chemprop\data\data.py", line 388, in iter
return super(MoleculeDataLoader, self).iter()
File "\anaconda3\envs\chemprop\lib\site-packages\torch\utils\data\dataloader.py", line 279, in iter
return _MultiProcessingDataLoaderIter(self)
File "\anaconda3\envs\chemprop\lib\site-packages\torch\utils\data\dataloader.py", line 719, in init
w.start()
File "\anaconda3\envs\chemprop\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "\anaconda3\envs\chemprop\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "\anaconda3\envs\chemprop\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "\anaconda3\envs\chemprop\lib\multiprocessing\popen_spawn_win32.py", line 65, in init
reduction.dump(process_obj, to_child)
File "\anaconda3\envs\chemprop\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'MoleculeDataLoader.init..construct_molecule_batch'

(chemprop) C:\chemprop>Traceback (most recent call last):
File "", line 1, in
File "\anaconda3\envs\chemprop\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "\anaconda3\envs\chemprop\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

[Question] What's the unit of performance?

Hi, Thanks for sharing the great project. I saw the result table in the README and wonder what the unit of MAE in the table is, especially about the QM9 Dataset. As you know QM9 Dataset has 19 different labeled targets including homo, lumo, gap and so on. Is the MAE in the result is the average of these targets?

Request to share the trained weights.

Hi!

I was wondering if the authors could share the weights on all the datasets they benchmarked DMPNN's model. I do not have necessary GPU computational power to train on the datasets and I anyway need to train the model on the datasets which authors have already trained on. So it would be of great benefit if somebody could share.

Thanks!

Can I turn off "split_type" flag in train.py to allow the model traning with all data?

Hi all,

Recently, I am using the train.py to train a predictive model for my data.
python train.py --data_path --dataset_type --save_dir

I know it takes "--split_type random" & "split_sizes (0.8,0.1,0.1)" & "--num_folds 1" by default. So it seems that it does 8:1:1 split into the total data, and only uses 80% of the data to train the model and save it for future prediction.

My questions are

What could I do if I want to use 100% of the data to train the model? For example, can I define split_size ---> (1,0,0) in my train code?
The nested cross-validation by tuning num_folds seems different from traditional n-fold cross-validation. According to the codes, it does train/val/test splits for n times, and then output the average test scores. Am I right?

Also, I wonder whether it will save 5 models if I set num_folds=5. If that's the case, what happens if I use the saved models (with num_folds=5 setting) for prediction? Do it takes the average predictions from the 5 models? The ensemble =1 in above cases.

Thank you!

The role of "num_folds" in hyperparameter_optimization.py

Hi all,

I am a little bit confused about the "num_folds" parameter in hyperparameter_optimization.py. So I submitted this ticket for better understanding.

For example, the script is:

python hyperparameter_optimization.py --data_path <data_path> --dataset_type --num_iters --config_save_path <config_path> --num_iters <n_iterations> --num_folds <n_folds>

I assume if num_folds =1, it will do train/val/test split of 8:1:1 for one time, and evaluate num_iters models with different hyperparameter combinations using the test score against 10% data. And the validation score against anther 10% data is used to control the optimal number of epochs.

Similarly, if num_folds =5, it will do the splits for 5 times, and select the hyperparameters with the highest average test scores across 5 test sets.

Also, in general, the num_folds setting here is different from the classical n-fold cross-validation, which splits the whole data into n folds and uses n-1 fold for training and 1 fold for testing.

I wonder if my understandings are correct. Appreciate for any feedback or ideas.

Thank you!
Cheng

Risk of insufficient memory when using huge training set

chemprop/chemprop/data/data.py

Line 61 in 71e5e86

@property

generating a RDKit molecule object for each datapoint during data loading may use lots of memory when training on a huge training set. I trained on 5M molecules, and when trained on a machine with 120G memory, the memory is insufficient with 8 workers. Probably need to allow molecule generating in the Dataloader.

could not covert string to float

Input args

--save_smiles_splits doesn't work if used when specifying smiles and target columns using the new flags --smiles_column & --target_columns in combination with --separate_val_path & --separate_test_path. There is a smiles error that gets thrown during training. A minor bug, nothing serious :-)

Missing schema.sql when installing from PyPi

I followed the Option 1 in Installation and it threw this error:

(graphNN) vinhtq115@Dell-G7-7588:~/PycharmProjects/graphNN$ chemprop_web
Traceback (most recent call last):
  File "/home/vinhtq115/anaconda3/envs/graphNN/bin/chemprop_web", line 8, in <module>
    sys.exit(chemprop_web())
  File "/home/vinhtq115/anaconda3/envs/graphNN/lib/python3.8/site-packages/chemprop/web/run.py", line 39, in chemprop_web
    run_web(args=WebArgs().parse_args())
  File "/home/vinhtq115/anaconda3/envs/graphNN/lib/python3.8/site-packages/chemprop/web/run.py", line 28, in run_web
    db.init_db()
  File "/home/vinhtq115/anaconda3/envs/graphNN/lib/python3.8/site-packages/chemprop/web/app/db.py", line 30, in init_db
    with current_app.open_resource('schema.sql') as f:
  File "/home/vinhtq115/anaconda3/envs/graphNN/lib/python3.8/site-packages/flask/helpers.py", line 1114, in open_resource
    return open(os.path.join(self.root_path, resource), mode)
FileNotFoundError: [Errno 2] No such file or directory: '/home/vinhtq115/anaconda3/envs/graphNN/lib/python3.8/site-packages/chemprop/web/app/schema.sql'

It seems that the file is not in the folder. Executing ls in the /home/vinhtq115/anaconda3/envs/graphNN/lib/python3.8/site-packages/chemprop/web/app/ returns these files and folders:
db.py __init__.py __pycache__ views.py web_checkpoints web_data

Update: I executed chemprop_web again and this time, it showed Flask running but accessing http://127.0.0.1:5000 returns HTTP error 500.

Training Loss printed looks incorrect.

chemprop/chemprop/train/train.py

Line 86 in 71e5e86

loss_avg = loss_sum / iter_count

The iter_count is updated by the batch_size in each iteration that the printed loss_avg seems like to be much smaller than the actual training loss.

modularize MoleculeModel

Hello!

Just wanted to start off by saying fantastic work on this project- it's truly an exciting advancement in molecular property prediction!

I'm writing to ask about the possibility of modularizing the Chemprop model. I'm trying to incorporate the model into some software of my own, but am finding it pretty unnatural to do so. My main issue arises from the need to extensively wrap code you've already written.
ex: The central model in Chemprop, the MoleculeModel class, takes "one" argument, args, of type TrainArgs. Essentially, this is just a Namespace that holds all of the keyword arguments to properly initialize a MoleculeModel, but it ends up obfuscating the arguments required to initialize a MoleculeModel. Additionally, one can't just initialize a TrainArgs object and pass that, because that requires specifying some unnecessary arguments (e.g. 'data_path' and 'target_columns') that are irrelevant to just creating the MoleculeModel itself. I got around this by manually inspecting the MoleculeModel class and searching for what attributes of a TrainArgs object that it accesses during initialization, then cross-referencing these with their default values from TrainArgs to create a wrapped initializer for a MoleculeModel, but I think your code would benefit from directly incorporating such an initializer so that future users like myself won't have to go to such lengths. (See below for a sample implementation of this. Also, please correct me if I've gotten anything wrong!)

class MPNN:
    """A wrapper for the base chemprop MoleculeModel"""
    def __init__(self, batch_size: int = 50,
                 atom_messages: bool = False, hidden_size: int = 300,
                 bias: bool = False, depth: int = 3, dropout: float = 0.0,
                 undirected: bool = False, features_only: bool = False,
                 use_input_features: bool = False, device: str = 'cpu',
                 features_size: Optional[int] = None, activation: str = 'ReLU',
                 ffn_hidden_size: Optional[int] = None,
                 ffn_num_layers: int = 2, num_tasks: int = 1,
                 metric: str = 'rmse', epochs: int = 30,
                 warmup_epochs: float = 2.0, total_epochs=0,
                 init_lr: float = 1e-4, max_lr: float = 1e-3,
                 final_lr: float = 1e-4):
        ffn_hidden_size = ffn_hidden_size or hidden_size
        self.model = MoleculeModel(Namespace(
            dataset_type='regression', num_tasks=num_tasks,
            atom_messages=atom_messages, hidden_size=hidden_size,
            bias=bias, depth=depth, dropout=dropout,
            undirected=undirected, features_only=features_only,
            use_input_features=use_input_features, device=device,
            features_size=features_size, activation=activation,
            ffn_hidden_size=ffn_hidden_size, ffn_num_layers=ffn_num_layers), featurizer=False
        )

edit: I now disagree with the point from my original comment reproduced below

Related to this (this is solely an opinion)- what are your thoughts on creating wrapped train() and predict() methods for Chemrop to more easily inline these methods into other software? My motivation for asking is that other types of predictive models like those in sklearn and tensorflow offer pretty straightforward syntax for fitting a model based on two parallel sequences of inputs and outputs. Trying to do the same for the Chemprop model is also proving to be not as trivial as one might hope. I think it might be beneficial for others looking to use your model in their own software if you provide these simplified methods as part of your library.

[Question] Multi-classification output?

I've trained a multi-classification model using 4 classes and run predictions with them and now I'm trying to understand the output of the predictions.

The target column is called Class, so after I run the predictions, I get a new column called Class_class_0 where the values look like "[2.7198659724615337e-17, 6.396466081322518e-18, 1.0, 1.50539243293224e-08]"

Are these something like probabilities for the 4 classes? So in this case, the sample should be assigned to the 3rd class label?

Can we featurize in parallel and save molgraph features to disk?

My dataset is quite large and it takes many hours to featurize. I wonder if it can be done in parallel and save the featurized data to disk?

Unable to install on Windows

Unable to install on Windows, as far as I can see, this is due to the web interface packages not being available for Windows. Is it possible to make the web interface an optional 'add-on' package, the core of ChemProp would then be installable on windows machines?

predict.py TypeError

Hi, excellent model all together, amazing work.

I get an error after I train (with no issues) the model and attempt to predict on a new set of smiles.

I tried changing the model, the smiles for prediction, but i keep running into same issue, here is the full output:

python predict.py --test_path rep_drugs.csv --checkpoint_dir save --preds_path rep_preds.csv
Loading training args
Loading data
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5794/5794 [00:01<00:00, 5280.02it/s]
Validating SMILES
Test size = 5,794
Predicting with an ensemble of 2 models
0%| | 0/2 [00:00<?, ?it/s]Loading pretrained parameter "encoder.encoder.cached_zero_vector".
Loading pretrained parameter "encoder.encoder.W_i.weight".
Loading pretrained parameter "encoder.encoder.W_h.weight".
Loading pretrained parameter "encoder.encoder.W_o.weight".
Loading pretrained parameter "encoder.encoder.W_o.bias".
Loading pretrained parameter "ffn.1.weight".
Loading pretrained parameter "ffn.1.bias".
Loading pretrained parameter "ffn.4.weight".
Loading pretrained parameter "ffn.4.bias".
0%| | 0/116 [00:00<?, ?it/s]
0%| | 0/2 [00:00<?, ?it/s]
Traceback (most recent call last):
File "predict.py", line 8, in
make_predictions(args)
File "/home/user/Desktop/chemprop-master/chemprop/train/make_predictions.py", line 72, in make_predictions
scaler=scaler
File "/home/user/Desktop/chemprop-master/chemprop/train/predict.py", line 39, in predict
batch_preds = model(batch, features_batch)
File "/home/user/anaconda3/envs/chemprop/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/user/Desktop/chemprop-master/chemprop/models/model.py", line 88, in forward
output = self.ffn(self.encoder(*input))
File "/home/user/anaconda3/envs/chemprop/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in call
result = self.forward(*input, **kwargs)
File "/home/user/Desktop/chemprop-master/chemprop/models/mpn.py", line 187, in forward
output = self.encoder.forward(batch, features_batch)
File "/home/user/Desktop/chemprop-master/chemprop/models/mpn.py", line 73, in forward
features_batch = torch.from_numpy(np.stack(features_batch)).float()
File "<array_function internals>", line 6, in stack
TypeError: dispatcher for array_function did not return an iterable

Not sure what the issue is.

subprocess.CalledProcessError: Command '['git', 'remote', 'get-url', 'origin']'

Hello,
I just re-install the new version of chemprop following the readme. But with this new version, I encountered this issue when I typed:
python train.py --data_path data/tox21.csv --dataset_type classification --save_dir tox21_checkpoints

The full error messages are:

Command line
python train.py --data_path data/tox21.csv --dataset_type classification --save_dir tox21_checkpoints
Args
{'activation': 'ReLU',
'atom_messages': False,
'batch_size': 50,
'bias': False,
'cache_cutoff': 10000,
'checkpoint_dir': None,
'checkpoint_path': None,
'checkpoint_paths': None,
'class_balance': False,
'config_path': None,
'crossval_index_dir': None,
'crossval_index_file': None,
'crossval_index_sets': None,
'cuda': False,
'data_path': 'data/tox21.csv',
'dataset_type': 'classification',
'depth': 3,
'device': device(type='cpu'),
'dropout': 0.0,
'ensemble_size': 1,
'epochs': 30,
'features_generator': None,
'features_only': False,
'features_path': None,
'features_scaling': True,
'features_size': None,
'ffn_hidden_size': 300,
'ffn_num_layers': 2,
'final_lr': 0.0001,
'folds_file': None,
'gpu': None,
'hidden_size': 300,
'init_lr': 0.0001,
'log_frequency': 10,
'max_data_size': None,
'max_lr': 0.001,
'metric': 'auc',
'minimize_score': False,
'multiclass_num_classes': 3,
'no_cuda': False,
'no_features_scaling': False,
'num_folds': 1,
'num_lrs': 1,
'num_tasks': None,
'num_workers': 8,
'pytorch_seed': 0,
'quiet': False,
'save_dir': 'tox21_checkpoints/fold_0',
'save_smiles_splits': False,
'seed': 0,
'separate_test_features_path': None,
'separate_test_path': None,
'separate_val_features_path': None,
'separate_val_path': None,
'show_individual_scores': False,
'smiles_column': None,
'split_sizes': (0.8, 0.1, 0.1),
'split_type': 'random',
'target_columns': None,
'task_names': None,
'test': False,
'test_fold_index': None,
'train_data_size': None,
'undirected': False,
'use_input_features': False,
'val_fold_index': None,
'warmup_epochs': 2.0}
Traceback (most recent call last):
File "train.py", line 11, in
cross_validate(args, logger)
File "/opt/chemprop/chemprop/train/cross_validate.py", line 29, in cross_validate
model_scores = run_training(args, logger)
File "/opt/chemprop/chemprop/train/run_training.py", line 48, in run_training
args.save(os.path.join(args.save_dir, 'args.json'))
File "/opt/miniconda3/envs/chemprop/lib/python3.6/site-packages/tap/tap.py", line 445, in save
json.dump(self._log_all(), f, indent=4, sort_keys=True, cls=PythonObjectEncoder)
File "/opt/miniconda3/envs/chemprop/lib/python3.6/site-packages/tap/tap.py", line 277, in _log_all
arg_log['reproducibility'] = self.get_reproducibility_info()
File "/opt/miniconda3/envs/chemprop/lib/python3.6/site-packages/tap/tap.py", line 266, in get_reproducibility_info
reproducibility['git_url'] = get_git_url(commit_hash=True)
File "/opt/miniconda3/envs/chemprop/lib/python3.6/site-packages/tap/utils.py", line 65, in get_git_url
url = check_output(['git', 'remote', 'get-url', 'origin'])
File "/opt/miniconda3/envs/chemprop/lib/python3.6/site-packages/tap/utils.py", line 33, in check_output
output = subprocess.check_output(command, stderr=devnull).decode('utf-8').strip()
File "/opt/miniconda3/envs/chemprop/lib/python3.6/subprocess.py", line 356, in check_output
**kwargs).stdout
File "/opt/miniconda3/envs/chemprop/lib/python3.6/subprocess.py", line 438, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['git', 'remote', 'get-url', 'origin']' returned non-zero exit status 129.

Do you have any suggestions to get rid of this error? BTW, I didn't have such issue when I installed and used the earlier chemprop version in March.

Other information if needed:
git config --get remote.origin.url, it shows:
https://github.com/chemprop/chemprop.git
git remote -v, it shows:
origin https://github.com/chemprop/chemprop.git (fetch)
origin https://github.com/chemprop/chemprop.git (push)
git version 1.8.3.1

Thank you!

transfer learning needed

Can we added transfer learning, which allow to load pre-trained model and train on new dataset.

[ERROR] subprocess.CalledProcessError: Command '['git', 'remote', 'get-url', 'origin']' returned non-zero exit status 129.

I am getting some error related to git with the latest changes to the master branch. Full trace back below

(chemprop) nlim@u18:~/Github/chemprop$ python train.py --data_path data/tox21.csv --dataset_type classification --save_dir tox21_checkpoints
Fold 0
Command line
python train.py --data_path data/tox21.csv --dataset_type classification --save_dir tox21_checkpoints
Args
{'activation': 'ReLU',
 'atom_messages': False,
 'batch_size': 50,
 'bias': False,
 'cache_cutoff': 10000,
 'checkpoint_dir': None,
 'checkpoint_path': None,
 'checkpoint_paths': None,
 'class_balance': False,
 'config_path': None,
 'crossval_index_dir': None,
 'crossval_index_file': None,
 'crossval_index_sets': None,
 'cuda': True,
 'data_path': 'data/tox21.csv',
 'dataset_type': 'classification',
 'depth': 3,
 'device': device(type='cuda'),
 'dropout': 0.0,
 'ensemble_size': 1,
 'epochs': 30,
 'features_generator': None,
 'features_only': False,
 'features_path': None,
 'features_scaling': True,
 'features_size': None,
 'ffn_hidden_size': 300,
 'ffn_num_layers': 2,
 'final_lr': 0.0001,
 'folds_file': None,
 'gpu': None,
 'hidden_size': 300,
 'init_lr': 0.0001,
 'log_frequency': 10,
 'max_data_size': None,
 'max_lr': 0.001,
 'metric': 'auc',
 'minimize_score': False,
 'multiclass_num_classes': 3,
 'no_cuda': False,
 'no_features_scaling': False,
 'num_folds': 1,
 'num_lrs': 1,
 'num_tasks': None,
 'num_workers': 4,
 'output_size': None,
 'pytorch_seed': 0,
 'quiet': False,
 'save_dir': 'tox21_checkpoints/fold_0',
 'save_smiles_splits': False,
 'seed': 0,
 'separate_test_features_path': None,
 'separate_test_path': None,
 'separate_val_features_path': None,
 'separate_val_path': None,
 'show_individual_scores': False,
 'smiles_column': None,
 'split_sizes': (0.8, 0.1, 0.1),
 'split_type': 'random',
 'target_columns': None,
 'task_names': None,
 'test': False,
 'test_fold_index': None,
 'train_data_size': None,
 'undirected': False,
 'use_input_features': False,
 'val_fold_index': None,
 'warmup_epochs': 2.0}
error: Unknown subcommand: get-url
usage: git remote [-v | --verbose]
   or: git remote add [-t <branch>] [-m <master>] [-f] [--tags|--no-tags] [--mirror=<fetch|push>] <name> <url>
   or: git remote rename <old> <new>
   or: git remote remove <name>
   or: git remote set-head <name> (-a | -d | <branch>)
   or: git remote [-v | --verbose] show [-n] <name>
   or: git remote prune [-n | --dry-run] <name>
   or: git remote [-v | --verbose] update [-p | --prune] [(<group> | <remote>)...]
   or: git remote set-branches [--add] <name> <branch>...
   or: git remote set-url [--push] <name> <newurl> [<oldurl>]
   or: git remote set-url --add <name> <newurl>
   or: git remote set-url --delete <name> <url>

    -v, --verbose         be verbose; must be placed before a subcommand

Traceback (most recent call last):
  File "train.py", line 11, in <module>
    cross_validate(args, logger)
  File "/data/nlim/Github/chemprop/chemprop/train/cross_validate.py", line 29, in cross_validate
    model_scores = run_training(args, logger)
  File "/data/nlim/Github/chemprop/chemprop/train/run_training.py", line 48, in run_training
    args.save(os.path.join(args.save_dir, 'args.json'))
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/tap/tap.py", line 445, in save
    json.dump(self._log_all(), f, indent=4, sort_keys=True, cls=PythonObjectEncoder)
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/tap/tap.py", line 277, in _log_all
    arg_log['reproducibility'] = self.get_reproducibility_info()
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/tap/tap.py", line 266, in get_reproducibility_info
    reproducibility['git_url'] = get_git_url(commit_hash=True)
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/tap/utils.py", line 61, in get_git_url
    url = check_output(['git', 'remote', 'get-url', 'origin'])
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/tap/utils.py", line 30, in check_output
    return subprocess.check_output(command).decode('utf-8').strip()
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['git', 'remote', 'get-url', 'origin']' returned non-zero exit status 129.
``

[Question] Train/Validation/Test Splits

When I provide a single CSV file containing all my molecules to chemprop_train --data_path and chemprop does the train/validation/test split, does chemprop save the identities which belong to the validation/test sets such that I can use the same file for chemprop_predict --test_path ? Specifically, I'm wondering if I have to remove molecules which appeared in the training dataset from my CSV file before passing it to chemprop_predict --test_path

[Question] Uncertainty in predicted values

Hello, just a quick question. Is it possible to get the uncertainty in the models predicted values?

[Bug] Issue with hyperparameter_optimization.py script due to typed-argument-parser version

It seems that in a more recent update to the repo something has gone wrong with attempting to do hyper_opt. We've tried two methods of running the scripts by either calculating features first with scripts/save_features.py and by not doing this. The two command-line arguments provided are as follows:

python scripts/save_features.py --data_path data/data/caco2_regression.csv --features_generator rdkit_2d_normalized --save_path features/caco2.npz --sequential
python hyperparameter_optimization.py --data_path data/caco2_regression.csv --config_save_path configs/caco2.json --quiet --split_type scaffold_balanced --features_path features/caco2.npz --no_features_scaling --num_iters 20 --num_folds 3 --dataset_type regression

In either case, hyperparameter_optimization.py gives the following error:

line 62, in create_ffn
first_linear_dim += args.features_size
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

[Question] neither output nor error reported when interpreting model prediction

Hello,

I've run interpret.py to interpret the model prediction with the Windows computer. But I didn't get any output, and there was no error reported, during running the script.
I think RAM is enough because there were only 100 SMILES strings in the file.
What might be the reasons for that?

Thanks in advance.

Broken link to web app

Hi,
link to the web prediction interface seems to be broken: chemprop.csail.mit.edu

How to add custom features in predict process?

I'm trying to add a custom features in my study. It's easy to add
--features_path ../data/features.csv
in train process. However, I'm a little puzzled how to add custom features in predict process. Is it the same feature file which I used in the train process? Or I need generate a new feature file corresponding to the predict dataset, not the former train dataset.

Error: Cannot allocate memory

I've been trying to train a model with a dataset size of approximately 1.4 million entries (attached)
full_smi+act-fix.csv.tar.gz

Running using the GPU version of pytorch, I ran the command python train.py --data_path DUDE/full_smi+act-fix.csv --dataset_type regression --save_dir DUDE/full-gpu_checkpoints --seed 0 --use_compound_names --split_type random --num_folds=3 --quiet --batch_size 5000

But this results in the error below:

  File "train.py", line 11, in <module>
    cross_validate(args, logger)
  File "/data/nlim/Github/chemprop/chemprop/train/cross_validate.py", line 29, in cross_validate
    model_scores = run_training(args, logger)
  File "/data/nlim/Github/chemprop/chemprop/train/run_training.py", line 196, in run_training
    writer=writer
  File "/data/nlim/Github/chemprop/chemprop/train/train.py", line 72, in train
    preds = model(batch, features_batch)
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/nlim/Github/chemprop/chemprop/models/model.py", line 88, in forward
    output = self.ffn(self.encoder(*input))
  File "/data/nlim/anaconda3/envs/chemprop/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/nlim/Github/chemprop/chemprop/models/mpn.py", line 185, in forward
    batch = mol2graph(batch, self.args)
  File "/data/nlim/Github/chemprop/chemprop/features/featurization.py", line 319, in mol2graph
    return BatchMolGraph(mol_graphs, args)
  File "/data/nlim/Github/chemprop/chemprop/features/featurization.py", line 251, in __init__
    self.f_bonds = torch.FloatTensor(f_bonds)
RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 181230420 bytes. Error code 12 (Cannot allocate memory)

Does this indicate that I am running out of memory?

[Question] run hyperopt in parallel over multiple GPUS?

Hello,

First of all, thanks so much for chemprop. I'm an absolute newbie in statistical learning but your software is so easy to run and to understand the output. My background is entirely on biochemistry so I really appreciate the verbosity and the ease of use.

Now for my actual question, I have access to an HPC cluster with 8 gpu nodes (tesla v100) and I wanted to run hyperparameter opt in parallel. Is that something that (a) makes sense and (b) is possible? If so, is there an implementation that I can port or can you point me towards where I should look at to implement it myself? When I submit an hyperopt job to our slurm queue it runs in sequential form on one GPU even if I change the number of works (perhaps I misunderstood what this parameter does).

Thanks again!

use atom-wise features to train the model

Hi,

Is chemprop able to take atomic properties on heavy atoms (like atomic charge) as extra features to train a model? Thank you.

Why we need index 0 as zero-padding for the tensors of atom features and bond features?

Just wondering, why the script (line 203~205) in featurization.py need index 0 as zero-padding? Does any specific case (or molecules) may crash when we remove zero-padding of the tensors of atom features and bond features?

Thank you so much!

Code fails to run when executed outside of install dir

[ERROR] subprocess.CalledProcessError: Command '['git', 'remote', 'get-url', 'origin']' returned non-zero exit status 129.

This problem persists for me even when I run the code from the install directory. Others have referenced a fix running the code from within the dir but this failed with the same error. There is a simple work around, commenting out line 46 from chemprop/chemprop/train/run_training.py and then code runs to completion with no errors.

41 # Print args
42 debug('Args')
43 debug(args)
44
45 # Save args Comment this out
46 #args.save(os.path.join(args.save_dir, 'args.json'))
47
48 # Set pytorch seed for random initial weights
49 torch.manual_seed(args.pytorch_seed)
50

Running the git config command within the install dir does return the repo correctly.
userid@login chemprop]$ git config --get remote.origin.url
https://github.com/chemprop/chemprop.git

Typically, you would not run the code from in the directory it is installed in and would instead run from a separate model build directory so seems like a fix is needed?

Full trace below:

Traceback (most recent call last):
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/utils.py", line 66, in get_git_url
url = check_output(['git', 'remote', 'get-url', 'origin'])
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/utils.py", line 33, in check_output
output = subprocess.check_output(command, stderr=devnull).decode('utf-8').strip()
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['git', 'remote', 'get-url', 'origin']' returned non-zero exit status 129.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "train.py", line 7, in
chemprop_train()
File "/hpc/scratch/userid/builds/chemprop/chemprop/utils.py", line 376, in wrap
result = func(*args, **kwargs)
File "/hpc/scratch/userid/builds/chemprop/chemprop/train/cross_validate.py", line 95, in chemprop_train
cross_validate(args=args, logger=logger)
File "/hpc/scratch/userid/builds/chemprop/chemprop/train/cross_validate.py", line 48, in cross_validate
model_scores = run_training(args, logger)
File "/hpc/scratch/userid/builds/chemprop/chemprop/train/run_training.py", line 47, in run_training
args.save(os.path.join(args.save_dir, 'args.json'))
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/tap.py", line 501, in save
args = self._log_all() if with_reproducibility else self.as_dict()
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/tap.py", line 292, in _log_all
arg_log['reproducibility'] = self.get_reproducibility_info()
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/tap.py", line 281, in get_reproducibility_info
reproducibility['git_url'] = get_git_url(commit_hash=True)
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/utils.py", line 69, in get_git_url
url = check_output(['git' 'config' '--get' 'remote.origin.url'])
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/tap/utils.py", line 33, in check_output
output = subprocess.check_output(command, stderr=devnull).decode('utf-8').strip()
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/subprocess.py", line 403, in run
with Popen(*popenargs, **kwargs) as process:
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/subprocess.py", line 709, in init
restore_signals, start_new_session)
File "/home/userid/.conda/envs/my-rdkit-env/lib/python3.6/subprocess.py", line 1344, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'gitconfig--getremote.origin.url': 'gitconfig--getremote.origin.url'
(my-rdkit-env) [userid@login1 chemprop]$

Originally posted by @JamesALumley in #28 (comment)

interpret with --feature_path

I run chemprop with --feature_path to incorporate my own features into the model. However, I found when I run interpret.py, it asked me to add --feature_generator because it detected that I have other features apart from the chemical smiles. I try to use --feature_path when running interpret since my features are not generated from the --feature_generator but it seems it doesn't take --feature_path.
Any suggestions? Thanks.

Error while running the code

Hi, I get the following error while running the code in chemprop environment:

(chemprop) C:\Users\1901566.admin\Downloads\chemprop-master>python web/run.py
Traceback (most recent call last):
File "web/run.py", line 10, in
from app import app, db
File "C:\Users\1901566.admin\Downloads\chemprop-master\web\app_init_.py", line 6, in
app.config.from_object('config')
File "C:\Users\1901566.admin.conda\envs\chemprop\lib\site-packages\flask\config.py", line 174, in from_object
obj = import_string(obj)
File "C:\Users\1901566.admin.conda\envs\chemprop\lib\site-packages\werkzeug\utils.py", line 568, in import_string
import(import_name)
File "C:\Users\1901566.admin\Downloads\chemprop-master\web\config.py", line 9, in
import torch
File "C:\Users\1901566.admin.conda\envs\chemprop\lib\site-packages\torch_init_.py", line 81, in
ctypes.CDLL(dll)
File "C:\Users\1901566.admin.conda\envs\chemprop\lib\ctypes_init_.py", line 348, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found

Could anyone please help me with it? Thank you

RuntimeError: DataLoader worker (pid 12130) is killed by signal: Killed.

I have a dataset of about 700K molecules which appear to load just fine but by the 3 epoch I receive the following error

Traceback (most recent call last):
  File "/data/nlim/anaconda3/envs/deepchem/bin/chemprop_train", line 11, in <module>
    load_entry_point('chemprop', 'console_scripts', 'chemprop_train')()
  File "/data/nlim/Github/chemprop/chemprop/train/cross_validate.py", line 78, in chemprop_train
    cross_validate(args, logger)
  File "/data/nlim/Github/chemprop/chemprop/train/cross_validate.py", line 35, in cross_validate
    model_scores = run_training(args, logger)
  File "/data/nlim/Github/chemprop/chemprop/train/run_training.py", line 203, in run_training
    writer=writer
  File "/data/nlim/Github/chemprop/chemprop/train/train.py", line 70, in train
    loss.backward()
  File "/data/nlim/anaconda3/envs/deepchem/lib/python3.6/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/data/nlim/anaconda3/envs/deepchem/lib/python3.6/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/data/nlim/anaconda3/envs/deepchem/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 12130) is killed by signal: Killed.

I am using chemprop 0.0.4 and torch 1.5.1 with CUDA. Full conda environment below

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_py-xgboost-mutex         2.0                       cpu_0    conda-forge
_tflow_select             2.3.0                       mkl  
absl-py                   0.9.0                    py36_0    conda-forge
alabaster                 0.7.12                   py36_0  
arrow-cpp                 0.15.1           py36h38a5dc4_3    conda-forge
astor                     0.8.1                    pypi_0    pypi
attrs                     19.3.0                     py_0  
babel                     2.8.0                      py_0  
backcall                  0.1.0                    py36_0  
biopython                 1.76             py36h516909a_0    conda-forge
blas                      1.0                    openblas  
bleach                    3.1.0                    py36_0  
blinker                   1.4                        py_1    conda-forge
blosc                     1.17.1               he1b5a44_0    conda-forge
bokeh                     2.0.0                    py36_0  
boost                     1.63.0           py36h415b752_1    rdkit
boost-cpp                 1.70.0               ha2d47e9_1    conda-forge
brotli                    1.0.7             he1b5a44_1001    conda-forge
bzip2                     1.0.8                h516909a_2    conda-forge
c-ares                    1.15.0            h516909a_1001    conda-forge
ca-certificates           2020.6.24                     0  
cachetools                4.0.0                    pypi_0    pypi
cairo                     1.16.0            h18b612c_1001    conda-forge
certifi                   2020.6.20                py36_0  
cffi                      1.13.2           py36h8022711_0    conda-forge
chardet                   3.0.4                 py36_1003    conda-forge
chemprop                  0.0.4                     dev_0    <develop>
click                     7.1.1              pyh8c360ce_0    conda-forge
cloudpickle               1.3.0                      py_0  
cryptography              2.8              py36h72c5cf5_1    conda-forge
cudatoolkit               10.1.243             h6bb024c_0  
cudnn                     7.6.5                cuda10.1_0  
cycler                    0.10.0                     py_2    conda-forge
cython                    0.29.15          py36he1b5a44_0    conda-forge
cytoolz                   0.10.1           py36h7b6447c_0  
dask                      2.12.0                     py_0  
dask-core                 2.12.0                     py_0  
dask-jobqueue             0.7.1                      py_0    conda-forge
dbus                      1.13.6               he372182_0    conda-forge
decorator                 4.4.1                      py_0  
deepchem                  2.3.1.dev18               dev_0    <develop>
defusedxml                0.6.0                      py_0  
descriptastorus           2.2.0.2                  pypi_0    pypi
dill                      0.3.1.1                  py36_0    conda-forge
distributed               2.14.0           py36h9f0ad1d_0    conda-forge
docutils                  0.16                     py36_0  
double-conversion         3.1.5                he1b5a44_2    conda-forge
entrypoints               0.3                      py36_0  
expat                     2.2.9                he1b5a44_2    conda-forge
fastparquet               0.3.3            py36hc1659b7_0    conda-forge
fftw3f                    3.3.4                         2    omnia
flaky                     3.6.1                      py_0    conda-forge
flask                     1.1.2                    pypi_0    pypi
fontconfig                2.13.1            he4413a7_1000    conda-forge
freetype                  2.10.0               he983fc9_1    conda-forge
fsspec                    0.6.2                      py_0  
future                    0.18.2                   pypi_0    pypi
gast                      0.2.2                      py_0    conda-forge
gettext                   0.19.8.1          hc5be6a0_1002    conda-forge
gflags                    2.2.2             he1b5a44_1002    conda-forge
glib                      2.58.3          py36h6f030ca_1002    conda-forge
glog                      0.4.0                he1b5a44_1    conda-forge
gmp                       6.1.2                h6c8ec71_1  
google-auth               1.11.2                     py_0    conda-forge
google-auth-oauthlib      0.4.1                      py_2    conda-forge
google-pasta              0.1.8                    pypi_0    pypi
grpc-cpp                  1.25.0               h18db393_0    conda-forge
grpcio                    1.27.2           py36hf8bcb03_0  
gst-plugins-base          1.14.5               h0935bb2_2    conda-forge
gstreamer                 1.14.5               h36ae1b5_2    conda-forge
h5py                      2.10.0          nompi_py36h513d04c_102    conda-forge
hdf5                      1.10.5          nompi_h3c11f04_1104    conda-forge
heapdict                  1.0.1                      py_0  
hyperopt                  0.2.4                    pypi_0    pypi
icu                       58.2              hf484d3e_1000    conda-forge
idna                      2.9                        py_1    conda-forge
imagesize                 1.2.0                      py_0  
importlib_metadata        1.5.0                    py36_0  
ipykernel                 5.1.4            py36h39e3cac_0  
ipython                   7.13.0           py36h5ca1d4c_0    conda-forge
ipython_genutils          0.2.0                    py36_0  
ipywidgets                7.5.1                      py_0  
itsdangerous              1.1.0                    pypi_0    pypi
jedi                      0.16.0                   py36_0  
jinja2                    2.11.1                     py_0  
joblib                    0.14.1                     py_0    conda-forge
jpeg                      9c                h14c3975_1001    conda-forge
jsonschema                3.2.0                    py36_0  
jupyter                   1.0.0                    py36_7  
jupyter_client            5.3.4                    py36_0  
jupyter_console           6.1.0                      py_0  
jupyter_core              4.6.1                    py36_0  
keras                     2.3.1                    py36_0    conda-forge
keras-applications        1.0.8                      py_1    conda-forge
keras-preprocessing       1.1.0                      py_0    conda-forge
kiwisolver                1.1.0            py36hc9558a2_0    conda-forge
ld_impl_linux-64          2.33.1               h53a641e_7  
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
libedit                   3.1.20181209         hc058e9b_0  
libevent                  2.1.10               h72c5cf5_0    conda-forge
libffi                    3.2.1                hd88cf55_4  
libgcc                    7.2.0                h69d50b8_2  
libgcc-ng                 9.1.0                hdf63c60_0  
libgfortran-ng            7.3.0                hdf63c60_5    conda-forge
libgpuarray               0.7.6             h14c3975_1003    conda-forge
libiconv                  1.15              h516909a_1005    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
libllvm8                  8.0.1                hc9558a2_0    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge
libpng                    1.6.37               hed695b0_0    conda-forge
libprotobuf               3.8.0                h8b12597_0    conda-forge
libsodium                 1.0.16               h1bed415_0  
libstdcxx-ng              9.1.0                hdf63c60_0  
libtiff                   4.1.0                hc3755c2_3    conda-forge
libuuid                   2.32.1            h14c3975_1000    conda-forge
libxcb                    1.13              h14c3975_1002    conda-forge
libxgboost                0.90                 he1b5a44_4    conda-forge
libxml2                   2.9.9                hea5a465_1  
llvmlite                  0.31.0           py36hfa65bc7_1    conda-forge
locket                    0.2.0                    py36_1  
lz4-c                     1.8.3             he1b5a44_1001    conda-forge
lzo                       2.10              h14c3975_1000    conda-forge
mako                      1.1.0                      py_0    conda-forge
markdown                  3.2.1                      py_0    conda-forge
markupsafe                1.1.1            py36h7b6447c_0  
matplotlib                3.3.0                    pypi_0    pypi
mdtraj                    1.9.1                    py36_1    deepchem
mistune                   0.8.4            py36h7b6447c_0  
mock                      3.0.5                    py36_0    conda-forge
msgpack-python            0.6.1            py36hfd86e86_1  
mypy-extensions           0.4.3                    pypi_0    pypi
nb_conda                  2.2.1                    py36_2    conda-forge
nb_conda_kernels          2.2.2                    py36_0  
nbconvert                 5.6.1                    py36_0  
nbformat                  5.0.4                      py_0  
ncurses                   6.1                  he6710b0_1  
networkx                  2.4                        py_0    conda-forge
nose                      1.3.7                 py36_1003    conda-forge
nose-timer                0.7.4                      py_0    conda-forge
notebook                  6.0.3                    py36_0  
numba                     0.48.0           py36hb3f55d8_0    conda-forge
numexpr                   2.7.1            py36hb3f55d8_0    conda-forge
numpy                     1.18.1           py36h95a1406_0    conda-forge
numpydoc                  0.9.2                      py_0  
oauthlib                  3.1.0                    pypi_0    pypi
oddt                      0.7                        py_0    oddt
olefile                   0.46                       py_0    conda-forge
openmm                    7.4.1           py36_cuda101_rc_1    omnia
openssl                   1.1.1g               h7b6447c_0  
opt-einsum                3.1.0                    pypi_0    pypi
opt_einsum                3.2.0                      py_0    conda-forge
packaging                 20.3                       py_0  
pandarallel               1.4.8                    pypi_0    pypi
pandas                    1.0.5                    pypi_0    pypi
pandas-flavor             0.2.0                    pypi_0    pypi
pandoc                    2.2.3.2                       0  
pandocfilters             1.4.2                    py36_1  
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.6.1                      py_0  
partd                     1.1.0                      py_0  
pbr                       5.4.2                      py_0    conda-forge
pcre                      8.44                 he1b5a44_0    conda-forge
pdbfixer                  1.6                      py36_0    omnia
pexpect                   4.8.0                    py36_0  
pickleshare               0.7.5                    py36_0  
pillow                    7.0.0            py36hb39fc2d_0  
pip                       20.0.2                   py36_1  
pixman                    0.38.0            h516909a_1003    conda-forge
prometheus_client         0.7.1                      py_0  
prompt_toolkit            3.0.3                      py_0  
protobuf                  3.11.3                   pypi_0    pypi
psutil                    5.7.0            py36h7b6447c_0  
pthread-stubs             0.4               h14c3975_1001    conda-forge
ptyprocess                0.6.0                    py36_0  
py-xgboost                0.90                     py36_4    conda-forge
pyarrow                   0.15.1           py36h8b68381_1    conda-forge
pyasn1                    0.4.8                      py_0    conda-forge
pyasn1-modules            0.2.8                    pypi_0    pypi
pycparser                 2.19                       py_2    conda-forge
pygments                  2.5.2                      py_0  
pygpu                     0.7.6           py36hc1659b7_1000    conda-forge
pyjwt                     1.7.1                      py_0    conda-forge
pyopenssl                 19.1.0                     py_1    conda-forge
pyparsing                 2.4.6                      py_0    conda-forge
pyqt                      5.9.2            py36hcca6a23_4    conda-forge
pyrsistent                0.15.7           py36h7b6447c_0  
pysocks                   1.7.1                    py36_0    conda-forge
pytables                  3.6.1            py36h9f153d1_1    conda-forge
python                    3.6.10               h0371630_0  
python-dateutil           2.8.1                      py_0  
python-snappy             0.5.4            py36he6710b0_0  
python_abi                3.6                     1_cp36m    conda-forge
pytz                      2019.3                     py_0    conda-forge
pyyaml                    5.3              py36h8c4c3a4_1    conda-forge
pyzmq                     18.1.1           py36he6710b0_0  
qt                        5.9.7                h52cfd70_2    conda-forge
qtconsole                 4.7.1                      py_0  
qtpy                      1.9.0                      py_0  
rdkit                     2017.09.1                py36_1    rdkit
re2                       2020.03.03           he1b5a44_0    conda-forge
readline                  7.0                  h7b6447c_5  
requests                  2.23.0                   py36_0    conda-forge
requests-oauthlib         1.3.0                    pypi_0    pypi
rsa                       4.0                        py_0    conda-forge
scikit-learn              0.22.2.post1     py36hcdab131_0    conda-forge
scipy                     1.4.1            py36h921218d_0    conda-forge
seaborn                   0.10.0                     py_0  
send2trash                1.5.0                    py36_0  
setuptools                45.2.0                   py36_0    conda-forge
simdna                    0.4.2                      py_0    deepchem
sip                       4.19.8           py36hf484d3e_0  
six                       1.14.0                   py36_0    conda-forge
snappy                    1.1.8                he1b5a44_1    conda-forge
snowballstemmer           2.0.0                      py_0  
sortedcontainers          2.1.0                    py36_0  
sphinx                    2.4.0                      py_0  
sphinxcontrib-applehelp   1.0.2                      py_0  
sphinxcontrib-devhelp     1.0.2                      py_0  
sphinxcontrib-htmlhelp    1.0.3                      py_0  
sphinxcontrib-jsmath      1.0.1                      py_0  
sphinxcontrib-qthelp      1.0.3                      py_0  
sphinxcontrib-serializinghtml 1.1.4                      py_0  
sqlite                    3.31.1               h7b6447c_0  
tblib                     1.6.0                      py_0  
tensorboard               1.14.0           py36hf484d3e_0  
tensorboardx              2.1                      pypi_0    pypi
tensorflow                1.14.0          mkl_py36h2526735_0  
tensorflow-base           1.14.0          mkl_py36h7ce6ba3_0  
tensorflow-estimator      1.14.0                     py_0  
tensorflow-gpu            1.14.0                   pypi_0    pypi
termcolor                 1.1.0                      py_2    conda-forge
terminado                 0.8.3                    py36_0  
testpath                  0.4.4                      py_0  
theano                    0.9.0                    py36_1    conda-forge
thrift                    0.11.0          py36he1b5a44_1001    conda-forge
thrift-cpp                0.12.0            hf3afdfd_1004    conda-forge
tk                        8.6.8                hbc83047_0  
toolz                     0.10.0                     py_0  
torch                     1.5.1                    pypi_0    pypi
tornado                   6.0.3            py36h7b6447c_3  
tqdm                      4.48.0                   pypi_0    pypi
traitlets                 4.3.3                    py36_0  
typed-argument-parser     1.5.2                    pypi_0    pypi
typing-inspect            0.6.0                    pypi_0    pypi
typing_extensions         3.7.4.1                  py36_0  
uriparser                 0.9.3                he1b5a44_1    conda-forge
urllib3                   1.25.7                   py36_0    conda-forge
wcwidth                   0.1.8                      py_0  
webencodings              0.5.1                    py36_1  
werkzeug                  1.0.0                      py_0    conda-forge
wheel                     0.34.2                   py36_0  
widgetsnbextension        3.5.1                    py36_0  
wrapt                     1.12.0                   pypi_0    pypi
xarray                    0.16.0                   pypi_0    pypi
xgboost                   0.90             py36he1b5a44_4    conda-forge
xorg-kbproto              1.0.7             h14c3975_1002    conda-forge
xorg-libice               1.0.10               h516909a_0    conda-forge
xorg-libsm                1.2.3             h84519dc_1000    conda-forge
xorg-libx11               1.6.9                h516909a_0    conda-forge
xorg-libxau               1.0.9                h14c3975_0    conda-forge
xorg-libxdmcp             1.1.3                h516909a_0    conda-forge
xorg-libxext              1.3.4                h516909a_0    conda-forge
xorg-libxrender           0.9.10            h516909a_1002    conda-forge
xorg-renderproto          0.11.1            h14c3975_1002    conda-forge
xorg-xextproto            7.3.0             h14c3975_1002    conda-forge
xorg-xproto               7.0.31            h14c3975_1007    conda-forge
xz                        5.2.4                h14c3975_4  
yaml                      0.2.2                h516909a_1    conda-forge
zeromq                    4.3.1                he6710b0_3  
zict                      2.0.0                      py_0  
zipp                      2.2.0                      py_0  
zlib                      1.2.11            h516909a_1006    conda-forge
zstd                      1.4.4                h3b9ef0a_1    conda-forge

can't find in pypi

I'm try to install chemprop by pip install chemprop, but can't find this package.

memory leakage when inference

Hi,

I found memory leakage (not GPU memory, but RAM) when inference.

As the number of inference increases, the amount of allocated memory also increase.

Thanks.

Hang when prediction data > available RAM

I've noticed that chemprop will hang once all available RAM is depleted, which happens if the prediction dataset is larger than the available RAM on the system.

For prediction purposes, is it necessary to load all of the test data into RAM at once? Can it be read as needed instead?

How can I reproduce the results in the paper?

Thank you very much for this excellent work, please tell me, how can I reproduce the results in the paper?

Interpreting: all smiles are failed for parsing

https://github.com/chemprop/chemprop#Interpreting
SMILES on the page cannot be parsed. all of them.

[Feature Request] Multi-GPU Training

Would it be possible to implement a way to do multi-GPU training via PyTorch? The tutorials on the PyTorch website make it appear easy to do so. I think it would be useful for speeding up training with larger datasets.

DataParallel Docs

MPI support?

Hi, can chemprop use mpi to run parallely? I tried in my grid but it run the same task for several time.

[Feature Request] Add --ignore_columns

I have some chemical names in my datasets. It'd be far more convenient for me if I could simply ignore that columns rather than needing to explicitly list all of the columns that I want to include.

I imagine the implementation is extremely simple, if --ignore_columns name is supplied, then the targets are all columns, or --target_columns, if supplied, minus name.

If you agree to this feature, I can send a PR.

How to add Rdkit Features exactly?

If I want to add DFs Rdkit features in my calculation. Should we add "--features_generator rdkit_2d_normalized --no_features_scaling" in every steps, such as Hyperparameter Optimization, train, predict, Interpreting? or I just need add them when train model.
Thank you~

ZINC15 Dataset

Hey guys!

Sorry to bother again, but do you happen to have a pre-processed version of the (complete) ZINC15 dataset you use in your Cell paper? Or, alternately, can you release the scripts which you used to prepare it? I can't find it in your released data archive.

Cheers!
Rich

error at atom_messages = True and undirected = True

Hi,

Here I got an error when I gave the options:

python train.py --data_path <data-csv> --dataset_type regression --save_dir test2 --atom_messages --undirected

Traceback (most recent call last):
  File "train.py", line 11, in <module>
    cross_validate(args, logger)
  File "/home/jinserk/kyu/chemprop/chemprop/train/cross_validate.py", line 29, in cross_validate
    model_scores = run_training(args, logger)
  File "/home/jinserk/kyu/chemprop/chemprop/train/run_training.py", line 180, in run_training
    n_iter = train(
  File "/home/jinserk/kyu/chemprop/chemprop/train/train.py", line 71, in train
    preds = model(batch, features_batch)
  File "/home/jinserk/.pyenv/versions/3.8.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jinserk/kyu/chemprop/chemprop/models/model.py", line 88, in forward
    output = self.ffn(self.encoder(*input))
  File "/home/jinserk/.pyenv/versions/3.8.2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/jinserk/kyu/chemprop/chemprop/models/mpn.py", line 187, in forward
    output = self.encoder.forward(batch, features_batch)
  File "/home/jinserk/kyu/chemprop/chemprop/models/mpn.py", line 102, in forward
    message = (message + message[b2revb]) / 2
RuntimeError: The size of tensor a (3616) must match the size of tensor b (8509) at non-singleton dimension 0

I'd like to know why this is happend.
Thank you!

ImportError

Does anyone encounter the following problem when importing the rdkit package?

ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version `CXXABI_1.3.11' not found (required by ......./anaconda3/envs/chemprop/lib/python3.6/site-packages/rdkit/DataStructs/../../../../libRDKitDataStructs.so.1)

I have gcc version 5.4.0 and rdkit version 2020.03.1

Thank you.

Does SMILES Canonicalization Affect Performance?

Sorry again to bother!

Quick question - should molecules be rendered into a "canonical" SMILES format before training, or does the model learn the features regardless? My intuition is that yes, they should all be canonicalized, and if you did this for your experiments, what method did you use to canonicalize?

Cheers!
R

[Question] SMILES validation

How is Chemprop checking or validating the input SMILES strings?

I'm getting warnings that 970 of my SMILES are invalid but when I check my SMILES through RDkit using Chem.MolFromSmiles() no SMILES are returning as invalid?

Default number of epochs should depend on class balance

I believe that the class_balance option leads to epochs that sample $2\cdot \mathrm{min}(\mathrm{num_pos}, \mathrm{num_neg})$ data points per epoch, where $\mathrm{num_pos}$ is the number of molecules with label 1 and $\mathrm{num_neg}$ for label 0. This means that the points sampled per epoch is $\alpha = 2\cdot \mathrm{min}(\mathrm{num_pos}, \mathrm{num_neg}) / (\mathrm{num_pos} + \mathrm{num_neg})$ times smaller than in the non-balanced case. I think it makes sense to scale the default number of epochs by $1/\alpha$ when using class_balance.

clarification in reproducing ESOL results?

Hello, I have been playing around with chemprop, and my particular goal is to reproduce the 0.555 ± 0.047 result that you report for ESOL. Currently I'm seeing numbers in the range of 0.72.

(as an aside, I've also tried playing around with main2.sh from https://github.com/yangkevin2/lsc_experiments. It's quite slow due to running so many experiments, and it produces a whole pile of results, of which I'm unsure which, if any, directly correspond to the reported 0.555 ± 0.047 result.)

Based on the general instructions in the "Results" section, I believe that I should first optimize hyperparameters, with a command like

 python hyperparameter_optimization.py --data_path data/esol.csv --dataset_type regression --num_iters 2 --config_save_path esol-hyper-config.json --quiet --split_type random

However I immediately have multiple questions

what is the correct value for num_iters?
what is the correct split_type for this dataset? Presumably either predetermined or index_predetermined
depending on the split type, what are the correct folds_file and test_fold_index (for predetermined), or the correct crossval_index_file (for index_predetermined)?
should I be adding --features_generator rdkit_2d_normalized --no_features_scaling for the ESOL dataset?

I believe the second and final step is to then run a command along the lines of

python train.py --data_path data/esol.csv --dataset_type regression --save_dir saves/esol --split_type scaffold_balanced --config_path esol-hyper-config.json

all the same questions about splits and folds apply. Should the settings be the same or different?
additionally, should I add --num_folds 10 and --seed 3? These seem to be the standard in the lsc_experiments repo

And as a final question, is this all otherwise correct?

Thank you!

Sklearn_train.py does not save train/test splits

I've tried using the sklearn_train.py scripts and they work but they do not seem to save the train/test splits, even when using the --save_smiles_splits flag.

Document Machine Requirements / Runtime

Hello, all! Fantastic work!

In your papers and GitHub documentation, I haven't found anything about that RAM/GPU/Time requirements for reproducing your results. Can this be done in a reasonable time on a normal workstation, or does it require an HPC cluster?

Cheers!,
Rich

chemprop / chemprop Goto Github PK

chemprop's Introduction

Chemprop

Version 1.x

v1 Documentation

v1 Tutorials and Examples

v1 Known Issues

chemprop's People

Contributors

Stargazers

Watchers

Forkers

chemprop's Issues

Recommend Projects

Recommend Topics

Recommend Org