kexinhuang12345 / moltrans Goto Github PK

View Code? Open in Web Editor NEW

168.0 168.0 44.0 29.81 MB

MolTrans: Molecular Interaction Transformer for Drug Target Interaction Prediction (Bioinformatics)

Home Page: https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btaa880/5929692

License: BSD 3-Clause "New" or "Revised" License

Python 12.87% Jupyter Notebook 87.13%

moltrans's People

Contributors

Stargazers

Watchers

moltrans's Issues

How to convert regression labels of Davis dataset to binary classification labels?

I guess you use a threshold to split the dataset based on the affinity score. How do you choose this threshold? Is there any reference paper to do so?

How to preprocess myself dataset

how to use your code to preprocess dataframe like

d_id	t_id	y
C#CC(N)CCC(=O)O	MGQACGHSILCRSQQ...	0

to get "drug_encoding" and "target_encoding"?
Thanks:)

您好对于您给出的代码复现结果和您的论文提出的结果差别较大，能够提供详细的训练过程日志和数据集的划分方式？

您好，如题。感谢

Hi kexin,
I used your code to do regression task on KIBA dataset, and it got OSError: [Errno 12] Cannot allocate memory, when I change num_workers to 0, it may be ok, but too slow. Is there any other ways to solve this problem?
Thanks:)

About computing pairwise interaction

Hi Kexin, I have the follwing questions.

I find the code for computing pairwise interaction a little complicated. Since you are using dot product, can I use torch.matmul(d_encoded_layers , p_encoded_layers.transpose(-1, -2)) directly instead of the following code?

MolTrans/models.py

Lines 86 to 100 in 47ac16b

 d_encoded_layers = self.d_encoder(d_emb.float(), ex_d_mask.float()) 

 #print(d_encoded_layers.shape) 

 p_encoded_layers = self.p_encoder(p_emb.float(), ex_p_mask.float()) 

 #print(p_encoded_layers.shape) 

 # repeat to have the same tensor size for aggregation  

 d_aug = torch.unsqueeze(d_encoded_layers, 2).repeat(1, 1, self.max_p, 1) # repeat along protein size 

 p_aug = torch.unsqueeze(p_encoded_layers, 1).repeat(1, self.max_d, 1, 1) # repeat along drug size 

 i = d_aug * p_aug # interaction 

 i_v = i.view(int(self.batch_size/self.gpus), -1, self.max_d, self.max_p) 

 # batch_size x embed size x max_drug_seq_len x max_protein_seq_len 

 i_v = torch.sum(i_v, dim = 1) 

 #print(i_v.shape) 

 i_v = torch.unsqueeze(i_v, 1)

Besides, the above code also confuses me a lot for the view operation in line 96, I tested it with a simple example, and it did not calculate the dot product between sub-structural pairs.

max_d = 2
max_p = 3

# batch_size 1 hidden dim 2
d_encoded_layers = torch.zeros(1, max_d, 2)
d_encoded_layers[0, 0, 0] = 1 
d_encoded_layers[0, 0, 1] = 1
p_encoded_layers = torch.zeros(1, max_p, 2)
p_encoded_layers[0, 0, 0] = 1
p_encoded_layers[0, 0, 1] = 2
p_encoded_layers[0, 1, 0] = 3
p_encoded_layers[0, 1, 1] = 4 
p_encoded_layers[0, 2, 0] = 5
p_encoded_layers[0, 2, 1] = 6

print(d_encoded_layers)
print(p_encoded_layers)

d_aug = torch.unsqueeze(d_encoded_layers, 2).repeat(1, 1, max_p, 1) # repeat along protein size
p_aug = torch.unsqueeze(p_encoded_layers, 1).repeat(1, max_d, 1, 1) # repeat along drug size
i = d_aug * p_aug
print(i)
i_v = i.view(1, -1, max_d, max_p) 
print(i_v)
i_v = torch.sum(i_v, dim = 1)
print(i_v)

output:

tensor([[[1., 1.],
         [0., 0.]]])
tensor([[[1., 2.],
         [3., 4.],
         [5., 6.]]])
tensor([[[[1., 2.],
          [3., 4.],
          [5., 6.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]]])
tensor([[[[1., 2., 3.],
          [4., 5., 6.]],

         [[0., 0., 0.],
          [0., 0., 0.]]]])
tensor([[[1., 2., 3.],
         [4., 5., 6.]]])

The final i_v looks pointless, because the representation for drug sub1 is all zero. I think view operation changes the arangement of data. Maybe the following code is more correct?

i_s = torch.sum(i, dim=-1)
print(i_s)

output:

tensor([[[ 3.,  7., 11.],
         [ 0.,  0.,  0.]]])

2.I think padding tokens should be filtered out of the interaction map $I$ before being fed into the CNN. I do this by passing the d_mask and p_mask:

  d_mask = d_mask.reshape(-1, self.max_d, 1)
  p_mask = p_mask.reshape(-1, 1, self.max_p)
  # mask padding tokens
  i.masked_fill_(~d_mask, 0)
  i.masked_fill_(~p_mask, 0)

Sorry to bother you.

How could I run the FCS algorithm result?

Hi @kexinhuang12345 , I have several questions about your FCS algorithm in MolTrans paper:

What's the dataset that you use to run FCS and get the frequent sub-structure?
How could I run the FCS algorithm? It seems not be in this repo.
About the drug SMILE sequence, did you use the canonical SMILE or just the SMILE format provided by the dataset itself?

This is a great job and inspire me a a lot. Many thanks!

Model gives different scores for the same drug-target pairs

Hi Kexin,

I tried to write the inference code to predict the binding affinity probability given the drug-target pairs. However, I found that the model always gives different scores for the same inputs d, p, d_mask, p_mask.

        score = model(d.long().cuda(), p.long().cuda(), d_mask.long().cuda(), p_mask.long().cuda())

Then, I entered two checkpoints in the train.py:

https://github.com/pykao/MolTrans/blob/c91a98eced0e18f9b63d439f84355f6225567287/train.py#L58

https://github.com/pykao/MolTrans/blob/c91a98eced0e18f9b63d439f84355f6225567287/train.py#L178

Then, I entered

CUDA_VISIBLE_DEVICES=2,3,5,6 python train.py --task davis

It entered the IPython interface.

In   [1]:   score = model(d.long().cuda(), p.long().cuda(), d_mask.long().cuda(), p_mask.long().cuda())

In   [2]:   score_1 = model(d.long().cuda(), p.long().cuda(), d_mask.long().cuda(), p_mask.long().cuda())

In   [3]:   score == score_1

Out[3]: 
tensor([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],                                  
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False]], device='cuda:0')

Could you tell me why the model give different scores while the input drug-target pairs are the same?

Best regards,
Po-Yu Kao

Dataset

Good evening, greetings from Colombia. I find very interesting your work on MolTrans and the TDC library, it saves a lot of work.
I was reviewing the MolTrans Datasets, and I compared them with the datasets of the TDC library. Specifically DAVIS and BindingDB; I realized that the SMILES are different, I explain:
In the MolTrans SMILES string the "=" symbol appears when there is a link between elements, but these do not appear in the SMILES string of the TDC library.

Could you explain me the reason for this? or is it a mistake? or is it another type of representation?
Do you think that if I transform them to SELFIES I could solve this problem and thus be able to compare the datasets?

Baseline model implementation

Hello, I saw your two great paper MolTrans and DeepPurpose.
According to MolTrans you compared MolTrans with DeepConv-DTI, GNN-CPI etc.

These baseline models are implemented in DeepPurpose?
For example, is the DeepConv-DTI model implemented as Drug Encodings=Morgan, Target Encodings=CNN?

If it is right, plz tell me baseline model - Encodings mapping table.
If it is false, could you provide baseline model implementation?

dimension mismatch

Dear Kexin,

Great work!

When I tried to run the training code: CUDA_VISIBLE_DEVICES=0,2,3,4,5,7 python train.py --task biosnap, it gave me the following error:

Let's use 6 GPUs!
--- Data Preparation ---
Traceback (most recent call last):
  File "train.py", line 207, in <module>
    model_max, loss_history = main()
  File "train.py", line 157, in main
    auc, auprc, f1, logits, loss = test(testing_generator, model_max)
  File "train.py", line 64, in test
    loss = loss_fct(logits, label)
  File "/home/ken/anaconda3/envs/MolTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ken/anaconda3/envs/MolTrans/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 612, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/home/ken/anaconda3/envs/MolTrans/lib/python3.7/site-packages/torch/nn/functional.py", line 3058, in binary_cross_entropy
    "Please ensure they have the same size.".format(target.size(), input.size())
ValueError: Using a target size (torch.Size([16])) that is different to the input size (torch.Size([12])) is deprecated. Please ensure they have the same size.

I think there is a dimension mismatch when we calculate the loss loss=loss_fct(logits, label)

Could you please help on solving this issue?

Best,
Po-Yu

About FCS mining

Hello, thanks for your nice research and package management.
By the way, is there are differences between Byte Pair Encoding and FCS mining algorithm written in paper?

Running MolTrans without GPU

hi @kexinhuang12345 hope you will be doing well I want to run your code can you please guide how to run it without using GPU because I don't have GPU on my laptop and for information I want to run the train.py file, not the experiment notebook.
Thanks

Where is the training data in example.ipynb

I tried to run example.ipynb, but I got

FileNotFoundError: [Errno 2] File /n/scratch3/users/k/kh278/bindingdb/fold1/train.csv does not exist: '/n/scratch3/users/k/kh278/bindingdb/fold1/train.csv'

In addition, when I run python train.py --task ${task_name} to run the experiments, I only can choose the task as 'biosnap'. However, I got

Traceback (most recent call last):
File "train.py", line 206, in
model_max, loss_history = main()
File "train.py", line 156, in main
auc, auprc, f1, logits, loss = test(testing_generator, model_max)
File "train.py", line 64, in test
loss = loss_fct(logits, label)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 530, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/functional.py", line 2519, in binary_cross_entropy
"Please ensure they have the same size.".format(target.size(), input.size()))
ValueError: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([16])) is deprecated. Please ensure they have the same size.

Thanks

Dataset

Hello, good morning. Greetings from Colombia! I like the work you have done in MolTrans.
I have some doubts with the datasets:

In the dataset there is a binary tag, with which the model is trained. This label tells if there is or not an interaction (1. interacts 0. Does not interact). I would like to know where you got this tag from, or what strategy you used for the creation of this binary tag.
How did you decide the interaction threshold (DTI)? was it deterministic? or a reference you already had? or what strategy did you use?
Regards.

why CUDA error？

python train.py --task bindingdb

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1

Can change this DTI model into DTA model?

Thank you for your repo. I think this model can be used as a DTA model if you change the problem from classification to regression. If it is correct, would you like to give us an example? Many thanks

wrong calculation of precision

Hi.
Thank you for sharing your paper and code.
While I am looking at your code, it seems that the line that calculates precision in the evaluation step is wrong.
So I wrote this issue.

MolTrans/train.py

Line 79 in 47ac16b

precision = tpr / (tpr + fpr)

https://en.wikipedia.org/wiki/Precision_and_recall
TPR= TP/(TP+FN) , FPR=FP/(FP+TN)
Precision=TP/(TP+FP)

Environment Setup

Do you have an environment setup file for MolTrans?

ESPF Construction

Hello

Is there ipynb or py examples for constructing ESPF library? (ESPF folder in this repository)
I want to reproduce ESPF construction using public data. (for example, number of iterations for FCS mining)

Thanks in advance for reply.

	d_encoded_layers = self.d_encoder(d_emb.float(), ex_d_mask.float())
	#print(d_encoded_layers.shape)
	p_encoded_layers = self.p_encoder(p_emb.float(), ex_p_mask.float())
	#print(p_encoded_layers.shape)

	# repeat to have the same tensor size for aggregation
	d_aug = torch.unsqueeze(d_encoded_layers, 2).repeat(1, 1, self.max_p, 1) # repeat along protein size
	p_aug = torch.unsqueeze(p_encoded_layers, 1).repeat(1, self.max_d, 1, 1) # repeat along drug size

	i = d_aug * p_aug # interaction
	i_v = i.view(int(self.batch_size/self.gpus), -1, self.max_d, self.max_p)
	# batch_size x embed size x max_drug_seq_len x max_protein_seq_len
	i_v = torch.sum(i_v, dim = 1)
	#print(i_v.shape)
	i_v = torch.unsqueeze(i_v, 1)

kexinhuang12345 / moltrans Goto Github PK

moltrans's People

Contributors

Stargazers

Watchers

Forkers

moltrans's Issues

Recommend Projects

Recommend Topics

Recommend Org