alexanderkroll / esp Goto Github PK

View Code? Open in Web Editor NEW

59.0 59.0 21.0 123.13 MB

License: MIT License

Shell 0.02% Jupyter Notebook 98.24% Python 1.74%

esp's People

Contributors

Stargazers

Watchers

esp's Issues

Missing Data

Dear Authors,
would you please provide the missing files:
/ESP/data/enzyme_data/Uniprot_df_with_ESM1b.pkl
/ESP/data/splits/df_train.pkl
/ESP/data/splits/df_test.pkl
Best,
Vahid

Difference between two prediction results(ESP_prediction provided on Github and online server)

          Hi, sorry for my late response. As you can read on the homepage of the webserver, the predictions on the webserver are made with our new ProSmith model that achieves superior prediction results.

Originally posted by @AlexanderKroll in #13 (comment)

Thanks for sharing this exceptional work. I noticed that there are a lot of files in the data folder, and I was wondering if there are organized data files for the substrate SMILES and protein FASTA pairs used in the paper. Based on my understanding, the dataset should consist of substrate and enzyme pairs. If such organized files are available, it would greatly aid in my research efforts.

In the event that these files are not currently included, I would greatly appreciate any guidance or direction you could provide on how to obtain them.

Thank you very much for your time and assistance.

Difference between two prediction results(ESP_prediction provided on Github and online server)

Dear Alex:
I found that there are differences in the prediction results using two methods (ESP_prediction provided on Github and online server), some of which are quite significant. Is the reason for this difference due to the different models? Which is more reliable in this situation? About ESP_prediciton, can you provide the latest trained model?

Looking forward to your reply!
Jing

Missing Data

Dear Authors,
would you please provide the missing files:
mol_folder = "C:\Users\alexk\mol-files\"
mol = Chem.MolFromMolFile(mol_folder + "mol-files\" + met_ID + '.mol')

Best,
Steve

Assistance Request for Creating 'all_sequences_esm1b_ts_mean.pt

Dear Authors
I hope that you are doing well.
I have calculated ESM1b_ts and ESM1b vectors . in next setp I want to create all_sequences_esm1b_ts_mean.pt, However, I am uncertain about the steps involved in this process.

Could you kindly provide guidance on how to proceed with this task?

Your assistance is greatly appreciated.

Best regards,
Vahid

Missing xgboost_training_KM.py

Hello, thank you very much for your work. But I have a problem, that is, I found that the xgboost_training_KM.py file is missing in your additional_code folder. I would appreciate it if you could provide it.

How to do inference on new data points

How do you do inference given a new sequence and a smiles molecule?

_pickle.UnpicklingError: invalid load key, 'v'.

Enhancement, Outliers in the dataset

Dear author, thank you for making this work open source and providing such a valuable dataset.

However, I have noticed some outliers in the data, which I think would be best to remove.

Specifically, there are protein sequences in the dataset that vary greatly in length, with the longest having 34,350 amino acids and the shortest having only 8.

_pickle.UnpicklingError: invalid load key, 'v'.

Hello Dear Authors,
I want to reproduce the results in PyCharm
After downloading and placing all files as it shown in picture

I am having an issue with louding `df_UID_MID.pkl` file in data_preprocessing
.py

df_UID_MID = pd.read_pickle(join(CURRENT_DIR, ".." ,"data", "enzyme_substrate_data", "df_UID_MID.pkl"))

Output :

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/vahidatabaigi/PycharmProjects/ESP-main/notebooks_and_code/additional_code/data_preprocessing.py", line 17, in <module>
    df_UID_MID = pd.read_pickle(join(CURRENT_DIR, ".." ,"data", "enzyme_substrate_data", "df_UID_MID.pkl"))
  File "/Users/vahidatabaigi/anaconda3/envs/ESP-Main/lib/python3.8/site-packages/pandas/io/pickle.py", line 196, in read_pickle
    with get_handle(
  File "/Users/vahidatabaigi/anaconda3/envs/ESP-Main/lib/python3.8/site-packages/pandas/io/common.py", line 710, in get_handle
    handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/vahidatabaigi/PycharmProjects/ESP-main/notebooks_and_code/additional_code/../data/enzyme_substrate_data/df_UID_MID.pkl'

However, as indicated in the picture, the file exists and the directory is accurate. I considered the possibility that the issue might be with the `join()` function. To address this error, I resolved it by directly pasting the path to `df_UID_MID.pkl` within `pd.read_pickle()`, removing the use of the `join()` function.

df_UID_MID = pd.read_pickle("/Users/vahidatabaigi/PycharmProjects/ESP-main/data/enzyme_substrate_data/df_UID_MID.pkl")

output:

Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 843, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/vahidatabaigi/PycharmProjects/ESP-main/notebooks_and_code/additional_code/data_preprocessing.py", line 18, in <module>
    df_UID_MID = pd.read_pickle("/Users/vahidatabaigi/PycharmProjects/ESP-main/data/enzyme_substrate_data/df_UID_MID.pkl")
  File "/Users/vahidatabaigi/anaconda3/envs/ESP-Main/lib/python3.8/site-packages/pandas/io/pickle.py", line 217, in read_pickle
    return pickle.load(handles.handle)  # type: ignore[arg-type]
_pickle.UnpicklingError: invalid load key, 'v'.

I am using mac M1 machine and I created a sepretated conda env and installed the follwoing packages, some of the required packages with certain version i wasnot able to install becuase of conflict. To resolve this, I incrementally adjusted the versions, seeking a compatible version.

(ESP-Main) vahidatabaigi@********** % conda list
# packages in environment at /Users/vahidatabaigi/anaconda3/envs/ESP-Main:
#
# Name                    Version                   Build  Channel
biopython                 1.81             py38hb991d35_0    conda-forge    
.
.
. 
jupyter                   1.0.0            py38hca03da5_8  
jupyter_client            7.4.9            py38hca03da5_0  
jupyter_console           6.6.3            py38hca03da5_0  
jupyter_core              5.3.0            py38hca03da5_0  
jupyter_events            0.6.3            py38hca03da5_0  
jupyter_server            1.23.4           py38hca03da5_0  
jupyter_server_fileid     0.9.0            py38hca03da5_0  
jupyter_server_ydoc       0.8.0            py38hca03da5_1  
jupyter_ydoc              0.2.4            py38hca03da5_0  
jupyterlab                3.6.3            py38hca03da5_0  
jupyterlab_pygments       0.1.2                      py_0  
jupyterlab_server         2.22.0           py38hca03da5_0  
jupyterlab_widgets        3.0.5            py38hca03da5_0  
numpy                     1.21.2           py38hb38b75b_0  
numpy-base                1.21.2           py38h6269429_0  
pandas                    1.3.0            py38h3777fb4_0    conda-forge
.
.
.
python                    3.8.12          hd949e87_1_cpython    conda-forge
python-dateutil           2.8.2              pyhd3eb1b0_0  
python-fastjsonschema     2.16.2           py38hca03da5_0  
python-json-logger        2.0.7            py38hca03da5_0  
python_abi                3.8                      2_cp38    conda-forge
pytorch                   2.0.1                   py3.8_0    pytorch
.
.
.
rdkit                     2021.03.3        py38hbcbf861_0    conda-forge
.
.
.

I am looking forward to your reply.
Thank you
Vahid

Missing of training data

When I tried to load data from "data/splits/df_train_with_ESM1b_ts.pkl" , I got the "UnpicklingError: invalid load key, 'v'."
And I checked the size of the file, it seems that the pkl file is not complete.
Can you provide the complete data file?
Thanks very much.

dear author can you tell me what should be placed in datasets_PubChem and mol_folder

Dear AlexanderKroll,

I am a user of your code and I have come across two paths while using it:
in ESP/notebooks_and_code/additional_code/data_preprocessing.py

Lines 14 and 15
datasets_PubChem = "D:\projects_deutschland\Prediction_of_KM_V3\datasets\substrate_synonyms"
mol_folder = "C:\Users\alexk\mol-files\"

Could you please let me know what files should be placed in these paths? I am trying to run your code and I want to ensure that I have put the files in the correct location for the code to run properly.

Thank you for your help!

Best regards

Generation of 'KEGG_drugs_df.pkl' and 'KEGG_substrate_df.pkl'

Hi Alex & community,
thank you for your great work and also assisting in so many of the issues.

I am stuck on the execution of the notebook: 1_0 - Creating enzyme-substrate database from GOA database.ipynb where the 2 pickles (KEGG_drugs_df.pkl and KEGG_substrate_df.pkl) are supposed to be read.
How can I generate these files? When i search in the entire repository for their generation I only get the pd.read_pickle(... answers.

Thanks for any hints
Best regards,
Felix

Missing command for training task specific ESM1b model

It seems that the command used to train the task specific ESM1b model is missing from the "1_0 - Creating enzyme-substrate database from GOA database.ipynb" notebook (See below screenshot)

Could you provide this command?

Thanks,
Max

Request for SMILES

I would like to use a method other than molecular fingerprints to represent the substrate molecule, could you please provide the SMILES string of the substrate molecule used in the paper, only the ChEBI number is not convenient, thanks.

Dataset for result of Table.1

Dear AlexanderKroll:
Thanks for the solid work and dataset construction. I'm doing the related work in the area of enzyme-drug interaction. I wonder if the two csv files :df_UID_MID_train_exp_1_1.csv /df_UID_MID_test_exp_phylo_1_1.csv are the original data for the result of Table 1? And is it ok to directly use these files as the benchmark for comparison experiments after preprocessing? Looking forward to your reply!

Questions about raining gradient boosting models

Dear author：
Hello，
I read the article code you provided, but I see that the model name you are using on the ESP prediction code is "xgboost_model_production_mode_gnn_esm1b_ts.dat".
Your code in jupyter doesn't save this model, you only save the model named "xgboost_model_production_mode.dat".
Moreover, I trained the xgboost model in your jupyter file, the shape of the data is (n, 1280+50), and in the prediction program, the dimension of each data is (n, 1280+100), what is the problem?
Looking forward to your reply.

Using the dataset

Hello,

I want to train my own model in this dataset, I would like to know where can I find the dataframe containing all the data with this 3 columns: "smiles", "protein_sequence", "score (0 or 1 depending if it catalyzes or not)".

Thank you and let me congratulate you for your great work

Creating negative data points

Hello,
I hope this message finds you well. I have a question regarding Section 6. Adding negative data pointssubsection (a)(iii) Creating negative data points for the test set:. I noticed that you generated negative data points for both phylogenetic and experimental evidences in the training set. However, for the test set, you only generated negative data points for experimental evidences. Could you kindly explain the rationale behind this decision?

I am looking forward to hear from you.
Thank you

Best,

Sequence length above maximum and Missing key(s) in state_dict

Hi Alex & community,
thank you for your great work

When I run the training_ESM1b_taskspecific.py script, how should I modify the code to run it properly? The datasets, named 'train_data_ESM_training.pkl' and 'validation_data_ESM_training.pkl', were directly downloaded from the repository. Subsequently, I converted them into Pickle protocol 4 for running properly.

Additionally, I am trying to train the ESM2 model with the same 1280 dimensions, but I am encountering an error in FullModel as shown below. Do you have any suggestions? Since I am loading the ESM model locally and not downloading it online, do I still need to use the alphabet to load the 'contact-regression.pt' file?

Thanks for any hints
Best regards,
Domi

alexanderkroll / esp Goto Github PK

esp's People

Contributors

Stargazers

Watchers

Forkers

esp's Issues

Hello Dear Authors, I want to reproduce the results in PyCharm After downloading and placing all files as it shown in picture

I am having an issue with louding df_UID_MID.pkl file in data_preprocessing .py

Output :

output:

I am using mac M1 machine and I created a sepretated conda env and installed the follwoing packages, some of the required packages with certain version i wasnot able to install becuase of conflict. To resolve this, I incrementally adjusted the versions, seeking a compatible version.

Recommend Projects

Recommend Topics

Recommend Org

Hello Dear Authors,
I want to reproduce the results in PyCharm
After downloading and placing all files as it shown in picture

I am having an issue with louding `df_UID_MID.pkl` file in data_preprocessing
.py