molecularai / chemformer Goto Github PK

View Code? Open in Web Editor NEW

188.0 188.0 34.0 330 KB

License: Apache License 2.0

Shell 0.47% Python 99.53%

chemformer's People

Contributors

Stargazers

Watchers

chemformer's Issues

Reproducing Claimed Results for Single-Step Retrosynthesis

I have trouble reproducing the results of the paper, and I wanted to know if there was anything that I was doing wrong. I evaluated all the fine-tuned models provided (https://az.app.box.com/s/7eci3nd9vy0xplqniitpk02rbg9q2zcq/folder/145316934583) on the USPTO-50k dataset, and the highest accuracy I can report was using the USPTO-SEP model.

Could you please elaborate on how the results were obtained and possibly share the list of reactants it was tested on? I could have somehow chosen a problematic test split that led to poor results.

models.py file is not in the repository?

Apologies if that's a dumb question, I wanted to look into the models.py file from readme but wasn't able to find it in this repository. Could you, please, help me with this? The readme says the following: "Models: The models.py file contains a Pytorch Lightning implementation of the BART language model, as well as Pytorch Lightning implementations of models for downstream tasks.", but I couldn't locate the models.py file.

Question on sampling desirable molecules

Hi!
In the experment of desirable molecules generation, Chemformer used beam search to generateoutput molecules, while the Transformer and Transformer-R used greedy search. Why make sampling different? Or each sampling method will produce same result?

SMILES to vector?

Is this model can make SMILES to vector?
How to do it?

best regard

install

I was running into several issues with the described setup.
First cuda 11.1 did not work with the described fairscale. I resolved that by using 0.4.1. I also dropped optuna for now because the installation seems to downgrade pytorch. Just wanted to leave it here in case others run into similar issues. It seems to run now.

Questions on smiles str canonicalization when inference

Hi there, thanks for your great work!

I found that in predict.py you directly use the input smiles string rather than a canonical one for inference. Is this intended to be that? If so, will two smiles string for the same molecule lead to different predictions?

Best regards!

Small setup.py change

Hi,
I failed to install Chemformer as module, and it seems to be because setup.py is missing packages=find_packages(exclude=[]). Could you maybe add that? It would ease our setup considerably. Thank you so much already!

SMILE to memory tensor

I've been working on using the encoder of a pre-trained transformer model to extract the 2D memory tensor representation of a SMILES molecule. After some initial challenges, I've successfully configured the model to generate these memory tensors.

It appears that the encoder expects an input vector where each element represents the IDs of the corresponding token in the SMILES sequence. To accommodate variations in SMILES length, the input vector is padded to a fixed length of 512 using the ID of the padding character (i.e., 0).

However, I'm still unsure about the specific role of the padding mask vector. Does it indicate true values at padded positions or where the actual SMILES sequence resides?

Issues finding the correct dataset

I have been unable to find some of the datasets that appear to be essential for the running of the code, such as the "uspto_50.txt" file. I converted the uspto_50.pickle to text in a tabulated format and changed the "reactants_mol" and "products_mol" columns to "reactants" and "products", respectively, but then the program could not find a sequence length column: KeyError: 'This dataset does not store any sequence lengths'.

Similarly, while while running predict.sh, I couldn't find the "bart_vocab_downstream.txt" file, only "bart_vocab_downstream.json file is available.

Could you please point me to the correct files: uspto_50.txt and bart_vocab_downstream.txt?

Using pretrained ChemFormer to get encoder and decoder outputs

Hello,
I would like to get the embeddings (output from BART encoder) and separately use the embeddings as an input to the decoder ti get back SMILES string as presented in step 1 in Figure1 in the paper. I don't see a direct way, could you please suggest me any methods?

Necessary GPUs for Molecular Property Prediction

Hi All,
Hope that you are okay. Just to provide a bit of background, I have been trying to run the finetuneRegr.py script in the conda command line in order to fine-tune the model and predict the conductivity of some compounds. I formatted my data file according to the steps detailed in #13 and specified the paths to my 1.data, 2. one of the models found in the link, and 3. the vocabulary text file with the added token.

After jumping through some hurdles (mostly changing some lines to account for the updated packages), I am now stuck after I get an error saying that the program needs [0] GPUs to run and I have: []. Could you please confirm that I'm following the right procedure for the task at hand? Are you aware of any way to get around this issue (such as changing the code or using a Cloud GPU Service?

Kind regards,
Diego

os.sched_getaffinity error

while running molbart.predict in macos I got the error i received the error AttributeError: module 'os' has no attribute 'sched_getaffinity' in molbart/modules/data/base.py. Similary I received error `AttributeError: module 'os' has no attribute 'sched_getaffinity' in molbart/models/chemformer.py. Currently I handled the issue with

try:
           self._num_workers = len(os.sched_getaffinity(0))
except AttributeError:
           self._num_workers = os.cpu_count() or 1

try:
           n_cpus = len(os.sched_getaffinity(0))
except AttributeError:
           n_cpus = os.cpu_count() or 1  # Default to 1 if cpu_count() is None

I am wondering if this serves the original purpose of sched_getaffinity(0) and is there a better way to handle it in mac ?

KeyError: 'LOCAL_RANK'

Hello,

When attempting to fine-tune in a multi-gpu SLURM environment, I get the following error:

Mon Sep 18 09:20:06 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:1A:00.0 Off |                    0 |
| N/A   31C    P0              44W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB           On  | 00000000:1C:00.0 Off |                    0 |
| N/A   27C    P0              41W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB           On  | 00000000:1D:00.0 Off |                    0 |
| N/A   28C    P0              42W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB           On  | 00000000:1E:00.0 Off |                    0 |
| N/A   30C    P0              42W / 300W |      0MiB / 32768MiB |      0%   E. Process |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
MASTER_ADDR=mg089
GpuFreq=control_disabled
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /etc/localtime required more than 50 (80) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (251) bind mounts
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
Multi-processing is handled by Slurm.
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 1, MEMBER: 2/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 3, MEMBER: 4/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 2, MEMBER: 3/4
/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py:50: UserWarning: you defined a validation_step but have no val_dataloader. Skipping validation loop
  warnings.warn(*args, **kwargs)
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/4
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
INFO:lightning:You have not specified an optimizer or scheduler within the DeepSpeed config.Using `configure_optimizers` to define optimizer and scheduler.
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'
N GPUS: 4
N NODES: 1
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 77, in <module>
    run()
  File "/ui/abv/hoangsx/Chemformer/temp/train_boring.py", line 72, in run
    trainer.fit(model, train_data)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 511, in fit
    self.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 540, in pre_dispatch
    self.accelerator.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 84, in pre_dispatch
    self.training_type_plugin.pre_dispatch()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 174, in pre_dispatch
    self.init_deepspeed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 193, in init_deepspeed
    self._initialize_deepspeed_train(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 218, in _initialize_deepspeed_train
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 112, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 145, in __init__
    self._configure_with_arguments(args, mpu)
  File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 499, in _configure_with_arguments
    args.local_rank = int(os.environ['LOCAL_RANK'])
  File "/opt/conda/lib/python3.8/os.py", line 675, in __getitem__
    raise KeyError(key) from None
KeyError: 'LOCAL_RANK'

My SLURM job request is as follows:

#!/bin/bash
#SBATCH --job-name=chemformer
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH -p gpu
#SBATCH --gres=gpu:4
#SBATCH --mem=64gb
#SBATCH --time=0:15:00
#SBATCH --output=slurm_out/output.%A_%a.txt

module load miniconda3
module load cuda/11.8.0
nvidia-smi

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

CONTAINER="singularity exec --nv $PROJ_DIR/singularity_images/chemformer.sif"

CMD="python -m molbart.fine_tune \
  --dataset uspto_50 \
  --data_path data/seq-to-seq_datasets/uspto_50.pickle \
  --model_path models/pre-trained/combined/step=1000000.ckpt \
  --task backward_prediction \
  --epochs 100 \
  --lr 0.001 \
  --schedule cycle \
  --batch_size 128 \
  --acc_batches 4 \
  --augment all \
  --aug_prob 0.5 \
  --gpus $SLURM_NTASKS_PER_NODE
  "

srun ${CONTAINER} ${CMD}

I am using the original library versions defined in the documentation. Is this behavior expected?

Issue replicating results

Hello, I've been looking to run predictions through your fine-tuned USPTO-50K model, and I'm getting 0% accuracy. Even though the generated molecules are similar, not a single one is exactly correct. Here are the test SMILES I gave the model as input. Here's an example of the output I'm getting:

python -m molbart.predict \
        --reactants_path {input_file} \
        --products_path {output_file} \
        --model_path {model_file} \
        --batch_size 64 \
        --n_beams 10

source       |  COC(=O)c1ccc2c(c1)n(CCC(C)C)c(=O)n2CCC(C)C
prediction   |  CC(C)CCBr.COC(=O)c1ccc2[nH]c(=O)n(CCC(C)C)c2c1
target       |  CC(C)CCn1c(=O)n(CCC(C)C)c2cc(C(=O)O)ccc21

The sha1sum of the checkpoint i downloaded is c859c68b198ac1b8cfab48196bdde6b35641bf81. However, I had to adjust the model to fit the Chemformer 2.0 code. I ran the following code to convert the model before using it.

chemUSPTO50 = torch.load('Chemformer USPTO-50k.ckpt')
chemUSPTO50['hyper_parameters']['vocabulary_size'] = chemUSPTO50['hyper_parameters'].pop('vocab_size')
torch.save(chemUSPTO50, 'Chemformer2 USPTO-50k.ckpt')

If you know why this is happening, or if you are using a different model for the results below please let me know. Thank you.

Interpreting Chemformer Predictions

This is the output on USPTO-15K dataset; a few questions on the output (I'm new to the pharma domain)

looks like this is a Retrosynthesis prediction task?
but whats with the question marks in the predictions?
if it is indeed retrosynthesis prediction, why are there exactly 10 prediction simpler molecules for all? I would think some molecules from the USPTO-15K dataset can be broken down into maybe less than 10 simpler molecules as opposed to broken down into always 10 simpler molecules.

The columns of the output file:

| original_smiles | prediction_0 | log_likelihood_0 | prediction_1 | log_likelihood_1 | prediction_2 | log_likelihood_2 | prediction_3 | log_likelihood_3 | prediction_4 | log_likelihood_4 | prediction_5 | log_likelihood_5 | prediction_6 | log_likelihood_6 | prediction_7 | log_likelihood_7 | prediction_8 | log_likelihood_8 | prediction_9 | log_likelihood_9

DeadLock with multiple gpus

When I pretrained the model from scratch, the traning process hangs during the first epoch. The GPU usage is stuck at 100%. The environment I used is as follows:

torch=1.8.0
pytorch_lightning=1.2.3
deepspeed=0.3.12

Is there anyone else encountered the same problem, and can give me some advice to tackle it？ Thank you！

Data preprocess

hi
I want to know how to get uspto_50.pickle from the origin uspto_50k.csv ?

Question USPTO50k

Hi! I just wanted to ask if the retrosynthesis experiments include the reagents or not.
Thank you already!

Inference errors with uploaded model checkpoints

I'm looking to run some predictions using the pretrained models you've uploaded (specifically models/fine-tuned/uspto-mixed/last.ckpt), and I've run into issues where pytorch lightning doesn't detect any hyperparams in the checkpoint model. I realize you've uploaded the hyperparams separately, but I'm not sure how to combine them into the model so that pytorch-lightning can load it correctly. Here is the exact error message I get:

>> $ python -m molbart.predict --reactants_path tests/test.csv  --model_path uspto_mixed/last.ckpt --task forward_prediction 

File "/baldig/chemistry/2023_rp/Chemformer/molbart/models/chemformer.py", line 387, in _initialize_from_ckpt
    self.model_path, decode_sampler=self.sampler
File "/home/mcangel/anaconda3/envs/chemformer/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
File "/home/mcangel/anaconda3/envs/chemformer/lib/python3.7/site-packages/pytorch_lightning/core/saving.py", line 198, in _load_model_state
    model = cls(**_cls_kwargs)
TypeError: __init__() missing 1 required positional argument: 'vocabulary_size'

If you have any ideas please let me know!

Conda env dependencies not found

Dear Chemformer Team,

I cloned your repo and tried to set up the conda env with your env.yml:

conda env create -f env.yaml

However, that simply prompts a massive list of unfound packages and aborts. Is there an updated / more lenient version of the env.yaml or some container you could offer?

Thanks a lot - Tobi

Collecting package metadata (repodata.json): done
Solving environment: failed

ResolvePackageNotFound:
  - libgomp==12.2.0=h65d4601_19
  - pillow==9.2.0=py37h850a105_2
  - pixman==0.40.0=h36c2ea0_0
  - fontconfig==2.14.2=h14ed4e7_0
  - mkl==2021.4.0=h06a4308_640
  - libstdcxx-ng==12.2.0=h46fd767_19
  - libuuid==2.32.1=h7f98852_1000
  - pcre==8.45=h9c3ff4c_0
  - intel-openmp==2021.4.0=h06a4308_3561
  - xorg-libice==1.0.10=h7f98852_0
  - numpy==1.21.5=py37h6c91a56_3
  - libopenblas==0.3.21=pthreads_h78a6416_3
  - _libgcc_mutex==0.1=conda_forge
  - sqlite==3.40.0=h4ff8645_0
  - python==3.7.11=h12debd9_0
  - xorg-libxdmcp==1.1.3=h7f98852_0
  - readline==8.2=h8228510_1
  - libglib==2.68.4=h3e27bee_0
  - libpng==1.6.39=h753d276_0
  - libwebp-base==1.3.0=h0b41bf4_0
  - lcms2==2.14=h6ed2654_0
  - brotlipy==0.7.0=py37h27cfd23_1003
  - boost-cpp==1.74.0=h312852a_4
  - icu==68.2=h9c3ff4c_0
  - certifi==2022.12.7=py37h06a4308_0
  - libdeflate==1.14=h166bdaf_0
  - xorg-renderproto==0.11.1=h7f98852_1002
  - openjpeg==2.5.0=h7d73246_1
  - libuv==1.44.2=h5eee18b_0
  - xorg-kbproto==1.0.7=h7f98852_1002
  - cairo==1.16.0=h6cf1ce9_1008
  - expat==2.5.0=h27087fc_0
  - pyopenssl==23.0.0=py37h06a4308_0
  - xorg-xextproto==7.3.0=h0b41bf4_1003
  - pysocks==1.7.1=py37_1
  - freetype==2.12.1=hca18f0e_1
  - urllib3==1.26.14=py37h06a4308_0
  - lerc==4.0.0=h27087fc_0
  - libgfortran5==12.2.0=h337968e_19
  - libffi==3.3=h58526e2_2
  - libsqlite==3.40.0=h753d276_0
  - tk==8.6.12=h27826a3_0
  - cffi==1.15.1=py37h74dc2b5_0
  - xorg-libxrender==0.9.10=h7f98852_1003
  - numpy-base==1.21.5=py37ha15fc14_3
  - bzip2==1.0.8=h7f98852_4
  - libiconv==1.17=h166bdaf_0
  - libzlib==1.2.13=h166bdaf_4
  - ncurses==6.3=h27087fc_1
  - ca-certificates==2023.05.30=h06a4308_0
  - mkl-service==2.4.0=py37h7f8727e_0
  - cudatoolkit==11.1.74=h6bb024c_0
  - xorg-libx11==1.8.4=h0b41bf4_0
  - libxcb==1.13=h7f98852_1004
  - xorg-libxext==1.3.4=h0b41bf4_2
  - boost==1.74.0=py37h796e4cb_5
  - libgcc-ng==12.2.0=h65d4601_19
  - zlib==1.2.13=h166bdaf_4
  - mkl_fft==1.3.1=py37hd3c417c_0
  - xorg-libxau==1.0.9=h7f98852_0
  - setuptools==59.8.0=py37h89c1867_1
  - cryptography==39.0.1=py37h9ce1e76_0
  - _openmp_mutex==4.5=2_gnu
  - openssl==1.1.1u=h7f8727e_0
  - xz==5.2.6=h166bdaf_0
  - ninja==1.10.2=h06a4308_5
  - pandas==1.3.5=py37he8f5f7f_0
  - rdkit==2020.09.1=py37h0c252aa_0
  - zstd==1.5.2=h3eb15da_6
  - jpeg==9e=h0b41bf4_3
  - ld_impl_linux-64==2.40=h41732ed_0
  - gettext==0.21.1=h27087fc_0
  - mkl_random==1.2.2=py37h51133e4_0
  - libgfortran-ng==12.2.0=h69a702a_19
  - idna==3.4=py37h06a4308_0
  - pycairo==1.21.0=py37h0afab05_1
  - pthread-stubs==0.4=h36c2ea0_1001
  - requests==2.28.1=py37h06a4308_0
  - xorg-libsm==1.2.3=hd9c2040_1000
  - libgcc==7.2.0=h69d50b8_2
  - ninja-base==1.10.2=hd09550d_5
  - xorg-xproto==7.0.31=h7f98852_1007
  - libtiff==4.4.0=h82bc61c_5

Trying out inference

Hi @EBjerrum , I followed the steps in creating new venv and also downloaded the data/models from the link hosted by AZ.
I'm assuming I could try running predict.py and obtain the results connecting model path and the dataset path in the args.
However, I have a problem understanding what the args mean.

--reactants_path : Is this the pickle file dataset USPTO_50 for example? but the code I think expects a text file
--model_path : the path here I have is from the downloaded pretrained model 'combined' version
--products_path : the path where dataframe is saved on
--vocab_path : the path of bart_vocab
The rest of them, just using default values.
Would like to get this working to put up a working dashboard on HF Space :) Thanks!

How to do random generation and Interpolation?

Hello,
I have seen that you collaborated with Nvidia on the Megatron version of Chemformer, which can perform interpolation between two given molecules. How is this implemented?

The source link for this image is as follows. https://github.com/NVIDIA/cheminformatics/blob/master/tutorial/Tutorial.md

Question on representation

Hi there,
I found that you are using the second item in the sequence for sentence-level representation, as shown here. I'm wondering why not taking the first token (the CLS token)?

AqSolDB - dataset for solubility, ignored intentionally?

Hi,
The same regression fine-tuning of 3 tasks simulataneously but adding the other solubility datasets as well from here , but before I go ahead with it, I'm curious if a thought was given to this or were there any disadvantages to this? maybe because the dataset isnt completely reliable.

Thanks

Regression Task with Biological Activity Data Using Pretrained Chemformer

Dear Chemformer Team,

I am currently embarking on a project aiming to perform regression analysis using biological activity data (specifically, pXC50 values) with the pretrained Chemformer model. The objective is to predict activity values based on SMILES strings.

In the process of setting up my environment and preparing for fine-tuning, I encountered a closed issue #13
and a fork of the repository, which provided clear examples and scripts for fine-tuning Chemformer on regression tasks. Notably, these resources referenced RegPropDataModule(_AbsDataModule) in finetune_regression_modules.py, suggesting it as a viable option for regression with Chemformer.

However, upon revisiting the Chemformer repository, it appears that the finetune_regression directory and RegPropDataModule class are no longer present in the example_scripts folder, which has left me uncertain about the best approach to undertake my regression task with the latest codebase.

With the above context, I am reaching out to seek your guidance on several points:

Current Recommended DataModule: Given the removal of RegPropDataModule and associated fine-tuning examples, could you advise on which DataModule in the current code structure is best suited for handling a dataset of SMILES strings with pXC50 values for regression analysis?
Script Selection: Among the scripts present in the repository (e.g., fine_tune.py, inference_score.py, predict.py), which would you recommend for fine-tuning the pretrained model on a regression dataset and for making subsequent predictions?
Further Recommendations: If there are any specific recommendations regarding data preprocessing, hyperparameter selection, or other considerations to optimize the use of Chemformer for this regression task, I would be grateful for your insights.

Thank you very much for your time and support !

Molecule Property prediction task

Since we dont have access to the weights of the fine-tuned model for molecule property prediction task, I ask how do we fine-tune for this task using the Chemformer codebase?

I see for finetune we use: python -m molbart.example_scripts.finetune_regression.finetuneRegr.py <args>

Below are just the default args, but I'd like to know how would the data_path file look like for molecule property prediction task?

DEFAULT_data_path = 'all_TrainTestVal_pXC50.csv'
DEFAULT_vocab_path = "prop_bart_vocab.txt"
DEFAULT_study_name = "TL_Regr_pXC50_" + str(datetime.datetime.now())
DEFAULT_model_path = "pt_zinc_bart.ckpt"
DEFAULT_genes_path = "/individual_files"
DEFAULT_BATCH_SIZE = 32
DEFAULT_ACC_BATCHES = 1
DEFAULT_GRAD_CLIP = 1.0
DEFAULT_SCHEDULE = "cycle"
DEFAULT_AUGMENT = True
DEFAULT_WARM_UP_STEPS = 3000
DEFAULT_TRAIN_TOKENS = None
DEFAULT_NUM_BUCKETS = 24
DEFAULT_LIMIT_VAL_BATCHES = 1.0
DEFAULT_EPOCHS = 150
DEFAULT_GPUS = 1
DEFAULT_D_PREMODEL = 512
DEFAULT_MAX_SEQ_LEN = 300
DEFAULT_LR = 3e-4
DEFAULT_H_FEEDFORWARD = 2048
DEFAULT_drp = 0.2
DEFAULT_Hdrp = 0.4
DEFAULT_WEIGHT_DECAY = 0.0

I have the 3 datasets with me: FreeSolv, Lipophilicity and ESOL.

Thanks in advance !

published article

Not an issue with the code but just letting you know that the pdf link at the accepted Chemformer version in Mach Learn: Sci Technol doesn't actually link to your article. It has your title on the front page but the rest of the pdf is an article called "Physics makes the difference: Bayesian optimization and active learning via augmented Gaussian process" by Ziatdinov et al. Just in case you want to tell the journal to correct it.

molecularai / chemformer Goto Github PK

chemformer's People

Contributors

Stargazers

Watchers

Forkers

chemformer's Issues

Recommend Projects

Recommend Topics

Recommend Org