Git Product home page Git Product logo

3dmolms's Introduction

3DMolMS

CC BY-NC-SA 4.0 (free for academic use)

3D Molecular Network for Mass Spectra Prediction (3DMolMS) is a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. This model's molecular representation, learned through MS/MS prediction tasks, can be further applied to enhance performance in other molecular-related tasks, such as predicting retention times and collision cross sections.

Read our paper in Bioinformatics | Try our online service at GNPS | Install from PyPI

Installation

3DMolMS is available on PyPI. You can install the latest version using pip:

pip install molnetpack

# PyTorch must be installed separately. 
# For CUDA 11.6, install PyTorch with the following command:
pip install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116

# For CUDA 11.7, use:
pip install torch==1.13.0+cu117 torchvision==0.14.0+cu117 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu117

# For CPU-only usage, use:
pip install torch==1.13.0+cpu torchvision==0.14.0+cpu torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cpu

3DMolMS can also be installed through source codes:

git clone https://github.com/JosieHong/3DMolMS.git
cd 3DMolMS

pip install .

Usage

To get started quickly, you can load a CSV or MGF file to predict MS/MS and then plot the predicted results.

import torch
from molnetpack import MolNet

# Set the device to CPU for CPU-only usage:
device = torch.device("cpu")

# For GPU usage, set the device as follows (replace '0' with your desired GPU index):
# gpu_index = 0
# device = torch.device(f"cuda:{gpu_index}")

# Instantiate a MolNet object
molnet_engine = MolNet(device, seed=42) # The random seed can be any integer. 

# Load input data (here we use a CSV file as an example)
molnet_engine.load_data(path_to_test_data='./test/input_msms.csv')
# molnet_engine.load_data(path_to_test_data='./test/input_msms.mgf') # MGF file is also supported

# Predict MS/MS
spectra = molnet_engine.pred_msms(path_to_results='./test/output_msms.mgf')

# Plot the predicted MS/MS with 3D molecular conformation
molnet_engine.plot_msms(dir_to_img='./img/')

For CCS prediction, please use the following codes after instantiating a MolNet object.

# Load input data
molnet_engine.load_data(path_to_test_data='./test/input_ccs.csv')

# Pred CCS
ccs_df = molnet_engine.pred_ccs(path_to_results='./test/output_ccs.csv')

For saving the molecular embeddings, please use the following codes after instantiating a MolNet object.

# Load input data
molnet_engine.load_data(path_to_test_data='./test/input_savefeat.csv')

# Inference to get the features
ids, features = molnet_engine.save_features()

print('Titles:', ids)
print('Features shape:', features.shape)

The sample input files, a CSV and an MGF, are located at ./test/demo_input.csv and ./test/demo_input.mgf, respectively. If the input data is only expected to be used in CCS prediction, you may assign an arbitrary numerical value to the Collision_Energy field in the CSV file or to COLLISION_ENERGY in the MGF file. It's important to note that during the data loading phase, any input formats that are not supported will be automatically excluded. Below is a table outlining the types of input data that are supported:

Item Supported input
Atom number <=300
Atom types 'C', 'O', 'N', 'H', 'P', 'S', 'F', 'Cl', 'B', 'Br', 'I', 'Na'
Precursor types '[M+H]+', '[M-H]-', '[M+H-H2O]+', '[M+Na]+', '[M+2H]2+'
Collision energy any number

The sample output files are at ./test/demo_msms.mgf and ./test/demo_ccs.csv. Below is an example of a predicted MS/MS spectrum plot.

The documents for running MS/MS prediction from source codes are at MSMS_PRED.md.

Train your own model

Step 0: Clone the Repository and Set Up the Environment

Clone the 3DMolMS repository and install the required packages using the following commands:

git clone https://github.com/JosieHong/3DMolMS.git
cd 3DMolMS

# Please install the packages if you have not installed them yet. 
pip install .

Step 1: Obtain the Pretrained Model

Download the pretrained model (molnet_pre_etkdgv3.pt.zip) from Google Drive or train the model yourself. For details on pretraining the model on the QM9 dataset, refer to PRETRAIN.md.

Step 2: Prepare the Datasets

Download and organize the datasets into the ./data/ directory. The current version uses four datasets:

  1. Agilent DPCL, provided by Agilent Technologies.
  2. NIST20, available under license for academic use.
  3. MoNA, publicly available.
  4. Waters QTOF, our own experimental dataset.

The data directory structure should look like this:

|- data
  |- origin
    |- Agilent_Combined.sdf
    |- Agilent_Metlin.sdf
    |- hr_msms_nist.SDF
    |- MoNA-export-All_LC-MS-MS_QTOF.sdf
    |- MoNA-export-All_LC-MS-MS_Orbitrap.sdf
    |- waters_qtof.mgf

Step 3: Preprocess the Datasets

Run the following commands to preprocess the datasets. Specify the dataset with --dataset and select the instrument type as qtof. Use --maxmin_pick to apply the MaxMin algorithm for selecting training molecules; otherwise, selection will be random. The dataset configurations are in ./src/molnetpack/config/preprocess_etkdgv3.yml.

python ./src/scripts/preprocess.py --dataset agilent nist mona waters \
--instrument_type qtof \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--mgf_dir ./data/mgf_debug/

Step 4: Train the Model

Use the following commands to train the model. Configuration settings for the model and training process are located in ./src/molnetpack/config/molnet.yml.

python ./src/scripts/train.py --train_data ./data/qtof_etkdgv3_train.pkl \
--test_data ./data/qtof_etkdgv3_test.pkl \
--model_config_path ./src/molnetpack/config/molnet.yml \
--data_config_path ./src/molnetpack/config/preprocess_etkdgv3.yml \
--checkpoint_path ./check_point/molnet_qtof_etkdgv3.pt \
--transfer --resume_path ./check_point/molnet_pre_etkdgv3.pt

Additional application

3DMolMS is also capable of predicting molecular properties and generating reference libraries for molecular identification. Examples of such applications include retention time prediction and collision cross-section prediction. For more details, refer to PROP_PRED.md and GEN_REFER_LIB.md respectively.

Citation

If you use 3DMolMS in your research, please cite our paper:

@article{hong20233dmolms,
  title={3DMolMS: prediction of tandem mass spectra from 3D molecular conformations},
  author={Hong, Yuhui and Li, Sujun and Welch, Christopher J and Tichy, Shane and Ye, Yuzhen and Tang, Haixu},
  journal={Bioinformatics},
  volume={39},
  number={6},
  pages={btad354},
  year={2023},
  publisher={Oxford University Press}
}

Thank you for considering 3DMolMS for your research needs!

License

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

3dmolms's People

Contributors

josiehong avatar dependabot[bot] avatar mwang87 avatar

Stargazers

 avatar wushaowen1992 avatar derky1202 avatar Tuobang Li avatar  avatar  avatar Fujian Zheng avatar  avatar Germain Salvato Vallverdu avatar Zhimin Zhang avatar Hongchao Ji avatar

Watchers

 avatar Kaiyuan Liu avatar Sangtae Kim avatar Kostas Georgiou avatar  avatar  avatar

3dmolms's Issues

No module named 'molnetpack'

from molnetpack.molnetpack.data_utils import conformation_array
No longer compatible with current version when running qm92pkl.py.

Fix?
from molmspack.data_utils.utils import conformation_array

Error when following instructions for preprocessing the QM9 datasets.

Followed instructions for pretraining with QM9 dataset downloaded from here
(https://figshare.com/collections/Quantum_chemistry_structures_and_properties_of_134_kilo_molecules/978904)

Issue:
qm92pkl.py does't run as intended.

When running def xyz2dict(xyz_dir) an error occurs error:

AttributeError: module 'numpy' has no attribute 'float'

The specific line:
pkl_list.append({'title': file_name, 'smiles': smiles, 'y': np.array(scalar_prop[3:16], dtype=np.float)})

Fix?
pkl_list.append({'title': file_name, 'smiles': smiles, 'y': np.array(scalar_prop[3:16], dtype=np.float64)})

all2pkl.py: Pandas import, undefined functions

Hello, I'm trying to use the main branch code to perform some inferences over a few molecules and I'm running into issues when providing a csv input. It seems that all2pkl.py is missing an import pandas as pd along with a definition or import for ce2nce.

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.