Git Product home page Git Product logo

syr-cn / reactxt Goto Github PK

View Code? Open in Web Editor NEW
10.0 1.0 0.0 2.41 MB

[ACL 2024] ReactXT: Understanding Molecular “Reaction-ship” via Reaction-Contextualized Molecule-Text Pretraining. by Zhiyuan Liu*, Yaorui Shi*, An Zhang, Sihang Li, Enzhi Zhang, Xiang Wang, Kenji Kawaguchi, Tat-Seng Chua

Home Page: https://syr-cn.github.io/ReactXT/

License: MIT License

Python 96.13% Jupyter Notebook 2.27% Shell 1.60%

reactxt's Introduction

ReactXT: Understanding Molecular “Reaction-ship” via Reaction-Contextualized Molecule-Text Pretraining

This repo contains the pytorch implementation of our ACL 2024 paper.

[paper], [Homepage], [Demo]

Authors: Zhiyuan Liu*, Yaorui Shi*, An Zhang, Sihang Li, Enzhi Zhang, Xiang Wang†, Kenji Kawaguchi, Tat-Seng Chua

* Equal Contribution

† Corresponding

Framework of ReactXT

fig1

We propose Reaction-Contextualized Molecule-Text Pretraining (ReactXT), a new pretraining method for reaction-text modeling.

  • ReactXT incorporates chemical reactions, instead of only single molecules, into the pretraining process.
  • ReactXT is good at both reaction-text generation and molecule-text generation downstream tasks.

fig1

ReactXT aims to improve React-Text modeling by introducing three types of input contexts.

  • Forward reaction: The forward reaction context contains molecule roles (Reactant/Catalyst/Solvent/Product), molecule SMILES, and 2D molecular graph embeddings.
  • Backward reaction: Similar to the forward context but with the order of molecular roles reversed. Suppose the forward context prediction trains the model to predict the product from the reactants, then the backward context prediction trains the model to predict the reactant from the product.
  • Random molecule: A small amount of random molecules are also included to ensure the LM retains the capability to describe individual molecules outside chemical reactions.

Comparison to previous molecule-text generative modeling methods

fig1

Most prior works either focus on generating the textual description of a single molecule, or apply LMs for chemical reaction prediction without including the textual descriptions of molecules/reactions in context. In contrast, ReactXT integrates both molecule-text and reaction contexts into the pretraining process, which enables the model to generate both molecule captions and reaction components.

Requirements

The requirements of this repo are detailed in requirements.txt. To create a new environment reactxt, run the following command:

conda create -n reactxt python=3.8
conda activate reactxt
pip install -r requirements.txt

Reproduce the results

Our datasets and pretrained model checkpoints can be downloaded from here.

The datasets should be placed in the ./data/ directory, and the pretrained model checkpoints should be placed in the ./all_checkpoints/ directory. If done correctly, the folders may look like this:

data
├── action_data
├── caption_data
├── ChEBI-20_data
├── pretrain_data
│   └── reactions
└── synthesis_data
    ├── USPTO_50K_PtoR
    │   ├── test
    │   ├── train
    │   └── val
    └── USPTO_50K_PtoR_aug20
        ├── test
        ├── test_noaug
        ├── train
        └── val
all_checkpoints
├── pretrain_hybrid
│   └── last.ckpt
└── synthesis_pretrain
    └── last.ckpt

Below is the workflow of finetuning on the downstream tasks:

flowchart TD
    A[Reaction-Text Pretraining] --> B(Experimental Procedure\nPrediction)
    A[Reaction-Text Pretraining] --> C(Molecule Captioning)
    A[Reaction-Text Pretraining] --> D[Synthesis Pretraining]
    D --> E(Retro-synthesis Prediction)
Loading

Reaction-Contextualized Molecule-Text Pretraining

Please run the following command to perform ReactXT pretraining based on the MolCA checkpoint. The original MolCA checkpoint can be downloaded from MolCA. This script uses both the reaction context and single molecule captioning to train the model, which corresponds to the experimental settings in the last row of our paper's Table 8.

bash scripts/run_pretrain.sh

After the pretraining step, the model will be saved in the ./all_checkpoints/ directory. You may use the convert.py to combine the model checkpoints.

Finetuning on downstream tasks

  1. Experimental Procedure Prediction on OpenExp

For the code to process the OpenExp dataset, refer to openExp/README.md.

Please run the following command to finetune the pretrained model on the OpenExp dataset. The script will reproduce the experimental result in Table 5.

bash scripts/run_action.sh

Since LMs with different tokenizers are compared in the paper, we use an external tokenizer to read the results for fair comparisons. Please use read_results/read_results.py to read the results.

  1. Molecule Captioning on PubChem324k and CheBI-20

Please run the following commands to finetune the pretrained model on PubChem324k and CheBI-20. The script will reproduce the experimental result in Table 6. The results are recorded in the output files.

bash scripts/run_caption.sh
bash scripts/run_chebi.sh
  1. Retro-synthesis Prediction on USPTO-50k

Please run the following commands to finetune the pretrained model on USPTO-50k. The script will reproduce the experimental result in Table 7.

  • Following R-SMILES, we use a 20-times augmented dataset for training and testing. The augmented dataset is included in the above download link. For more details about the augmentation, please refer to R-SMILES.

  • Following R-SMILES and AT, we train the model on USPTO-full before the USPTO-50k dataset. The model checkpoint trained on USPTO-full is also included in the above download link (./all_checkpoints/synthesis_pretrain/last.ckpt). The following script will fine-tune the model on USPTO-50k based on this checkpoint. If you are interested in the training process on USPTO-full, please download the augmented dataset from R-SMILES.

bash scripts/run_retro.sh

The generated results will be saved in ./all_checkpoints/. Please use read_results/score.py to get the top-1~10 accuracy and valid rate.

Citation

If you find this paper useful, please cite us:

@inproceedings{liu2024reactxt,
    title={ReactXT: Understanding Molecular “Reaction-ship” via Reaction-Contextualized Molecule-Text Pretraining},
    author={Liu, Zhiyuan and Shi, Yaorui and Zhang, An and Li, Sihang and Zhang, Enzhi and Wang, Xiang and Kawaguchi, Kenji and Chua, Tat-Seng},
    booktitle={Findings of the Association for Computational Linguistics: {ACL} 2024},
    publisher={Association for Computational Linguistics},
    year={2024},
    url={https://openreview.net/forum?id=V-ejDfLiwe}
}

reactxt's People

Contributors

syr-cn avatar

Stargazers

Minseong Bae avatar Congcong Sun avatar Qizhi Pei avatar  avatar Yuhang Ding avatar Bin Hong avatar Xiaohao Liu avatar Xiang Wang avatar  avatar 刘致远 avatar

Watchers

 avatar

reactxt's Issues

AttributeError: 'Namespace' object has no attribute 'generate_restrict_tokens'

尝试运行 run_action.sh,出现以下报错
Traceback (most recent call last):
File "main.py", line 153, in
main(get_args())
File "main.py", line 33, in main
model = Blip2Model(args)
File "/sshfs/chenzhe/code/patent/ReactXT/model/blip2_model.py", line 92, in init
self.blip2opt = Blip2OPT(args.bert_name, args.gin_num_layers, args.gin_hidden_dim, args.drop_ratio, args.tune_gnn, not args.not_tune_qformer, args.num_query_token, args.cross_attention_freq, args.llm_tune, args.peft_dir, args.opt_model, args.prompt, args)
File "/sshfs/chenzhe/code/patent/ReactXT/model/blip2_opt.py", line 225, in init
self.generate_restrict_tokens = args.generate_restrict_tokens
AttributeError: 'Namespace' object has no attribute 'generate_restrict_tokens'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.