Git Product home page Git Product logo

moltrans's People

Contributors

kexinhuang12345 avatar limberc avatar printomi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

moltrans's Issues

How to preprocess myself dataset

how to use your code to preprocess dataframe like

d_id t_id y
C#CC(N)CCC(=O)O MGQACGHSILCRSQQ... 0

to get "drug_encoding" and "target_encoding"?
Thanks:)

Running error:(

Hi kexin,
I used your code to do regression task on KIBA dataset, and it got OSError: [Errno 12] Cannot allocate memory, when I change num_workers to 0, it may be ok, but too slow. Is there any other ways to solve this problem?
Thanks:)

About computing pairwise interaction

Hi Kexin, I have the follwing questions.

  1. I find the code for computing pairwise interaction a little complicated. Since you are using dot product, can I use torch.matmul(d_encoded_layers , p_encoded_layers.transpose(-1, -2)) directly instead of the following code?

MolTrans/models.py

Lines 86 to 100 in 47ac16b

d_encoded_layers = self.d_encoder(d_emb.float(), ex_d_mask.float())
#print(d_encoded_layers.shape)
p_encoded_layers = self.p_encoder(p_emb.float(), ex_p_mask.float())
#print(p_encoded_layers.shape)
# repeat to have the same tensor size for aggregation
d_aug = torch.unsqueeze(d_encoded_layers, 2).repeat(1, 1, self.max_p, 1) # repeat along protein size
p_aug = torch.unsqueeze(p_encoded_layers, 1).repeat(1, self.max_d, 1, 1) # repeat along drug size
i = d_aug * p_aug # interaction
i_v = i.view(int(self.batch_size/self.gpus), -1, self.max_d, self.max_p)
# batch_size x embed size x max_drug_seq_len x max_protein_seq_len
i_v = torch.sum(i_v, dim = 1)
#print(i_v.shape)
i_v = torch.unsqueeze(i_v, 1)

Besides, the above code also confuses me a lot for the view operation in line 96, I tested it with a simple example, and it did not calculate the dot product between sub-structural pairs.

max_d = 2
max_p = 3

# batch_size 1 hidden dim 2
d_encoded_layers = torch.zeros(1, max_d, 2)
d_encoded_layers[0, 0, 0] = 1 
d_encoded_layers[0, 0, 1] = 1
p_encoded_layers = torch.zeros(1, max_p, 2)
p_encoded_layers[0, 0, 0] = 1
p_encoded_layers[0, 0, 1] = 2
p_encoded_layers[0, 1, 0] = 3
p_encoded_layers[0, 1, 1] = 4 
p_encoded_layers[0, 2, 0] = 5
p_encoded_layers[0, 2, 1] = 6

print(d_encoded_layers)
print(p_encoded_layers)

d_aug = torch.unsqueeze(d_encoded_layers, 2).repeat(1, 1, max_p, 1) # repeat along protein size
p_aug = torch.unsqueeze(p_encoded_layers, 1).repeat(1, max_d, 1, 1) # repeat along drug size
i = d_aug * p_aug
print(i)
i_v = i.view(1, -1, max_d, max_p) 
print(i_v)
i_v = torch.sum(i_v, dim = 1)
print(i_v)

output:

tensor([[[1., 1.],
         [0., 0.]]])
tensor([[[1., 2.],
         [3., 4.],
         [5., 6.]]])
tensor([[[[1., 2.],
          [3., 4.],
          [5., 6.]],

         [[0., 0.],
          [0., 0.],
          [0., 0.]]]])
tensor([[[[1., 2., 3.],
          [4., 5., 6.]],

         [[0., 0., 0.],
          [0., 0., 0.]]]])
tensor([[[1., 2., 3.],
         [4., 5., 6.]]])

The final i_v looks pointless, because the representation for drug sub1 is all zero. I think view operation changes the arangement of data. Maybe the following code is more correct?

i_s = torch.sum(i, dim=-1)
print(i_s) 

output:

tensor([[[ 3.,  7., 11.],
         [ 0.,  0.,  0.]]])

2.I think padding tokens should be filtered out of the interaction map $I$ before being fed into the CNN. I do this by passing the d_mask and p_mask:

  d_mask = d_mask.reshape(-1, self.max_d, 1)
  p_mask = p_mask.reshape(-1, 1, self.max_p)
  # mask padding tokens
  i.masked_fill_(~d_mask, 0)
  i.masked_fill_(~p_mask, 0)

Sorry to bother you.

How could I run the FCS algorithm result?

Hi @kexinhuang12345 , I have several questions about your FCS algorithm in MolTrans paper:

  • What's the dataset that you use to run FCS and get the frequent sub-structure?
  • How could I run the FCS algorithm? It seems not be in this repo.
  • About the drug SMILE sequence, did you use the canonical SMILE or just the SMILE format provided by the dataset itself?

This is a great job and inspire me a a lot. Many thanks!

Model gives different scores for the same drug-target pairs

Hi Kexin,

I tried to write the inference code to predict the binding affinity probability given the drug-target pairs. However, I found that the model always gives different scores for the same inputs d, p, d_mask, p_mask.

        score = model(d.long().cuda(), p.long().cuda(), d_mask.long().cuda(), p_mask.long().cuda())

Then, I entered two checkpoints in the train.py:

https://github.com/pykao/MolTrans/blob/c91a98eced0e18f9b63d439f84355f6225567287/train.py#L58

https://github.com/pykao/MolTrans/blob/c91a98eced0e18f9b63d439f84355f6225567287/train.py#L178

Then, I entered

CUDA_VISIBLE_DEVICES=2,3,5,6 python train.py --task davis

It entered the IPython interface.

In   [1]:   score = model(d.long().cuda(), p.long().cuda(), d_mask.long().cuda(), p_mask.long().cuda())

In   [2]:   score_1 = model(d.long().cuda(), p.long().cuda(), d_mask.long().cuda(), p_mask.long().cuda())

In   [3]:   score == score_1

Out[3]: 
tensor([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],                                  
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False]], device='cuda:0')

Could you tell me why the model give different scores while the input drug-target pairs are the same?

Best regards,
Po-Yu Kao

Dataset

Good evening, greetings from Colombia. I find very interesting your work on MolTrans and the TDC library, it saves a lot of work.
I was reviewing the MolTrans Datasets, and I compared them with the datasets of the TDC library. Specifically DAVIS and BindingDB; I realized that the SMILES are different, I explain:
In the MolTrans SMILES string the "=" symbol appears when there is a link between elements, but these do not appear in the SMILES string of the TDC library.

  • Could you explain me the reason for this? or is it a mistake? or is it another type of representation?
  • Do you think that if I transform them to SELFIES I could solve this problem and thus be able to compare the datasets?

smilestdc

smilesmoltrans

Baseline model implementation

Hello, I saw your two great paper MolTrans and DeepPurpose.
According to MolTrans you compared MolTrans with DeepConv-DTI, GNN-CPI etc.

These baseline models are implemented in DeepPurpose?
For example, is the DeepConv-DTI model implemented as Drug Encodings=Morgan, Target Encodings=CNN?

If it is right, plz tell me baseline model - Encodings mapping table.
If it is false, could you provide baseline model implementation?

dimension mismatch

Dear Kexin,

Great work!

When I tried to run the training code: CUDA_VISIBLE_DEVICES=0,2,3,4,5,7 python train.py --task biosnap, it gave me the following error:

Let's use 6 GPUs!
--- Data Preparation ---
Traceback (most recent call last):
  File "train.py", line 207, in <module>
    model_max, loss_history = main()
  File "train.py", line 157, in main
    auc, auprc, f1, logits, loss = test(testing_generator, model_max)
  File "train.py", line 64, in test
    loss = loss_fct(logits, label)
  File "/home/ken/anaconda3/envs/MolTrans/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ken/anaconda3/envs/MolTrans/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 612, in forward
    return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
  File "/home/ken/anaconda3/envs/MolTrans/lib/python3.7/site-packages/torch/nn/functional.py", line 3058, in binary_cross_entropy
    "Please ensure they have the same size.".format(target.size(), input.size())
ValueError: Using a target size (torch.Size([16])) that is different to the input size (torch.Size([12])) is deprecated. Please ensure they have the same size.

I think there is a dimension mismatch when we calculate the loss loss=loss_fct(logits, label)

Could you please help on solving this issue?

Best,
Po-Yu

About FCS mining

Hello, thanks for your nice research and package management.
By the way, is there are differences between Byte Pair Encoding and FCS mining algorithm written in paper?

Running MolTrans without GPU

hi @kexinhuang12345 hope you will be doing well I want to run your code can you please guide how to run it without using GPU because I don't have GPU on my laptop and for information I want to run the train.py file, not the experiment notebook.
Thanks

Where is the training data in example.ipynb

I tried to run example.ipynb, but I got

FileNotFoundError: [Errno 2] File /n/scratch3/users/k/kh278/bindingdb/fold1/train.csv does not exist: '/n/scratch3/users/k/kh278/bindingdb/fold1/train.csv'

In addition, when I run python train.py --task ${task_name} to run the experiments, I only can choose the task as 'biosnap'. However, I got

Traceback (most recent call last):
File "train.py", line 206, in
model_max, loss_history = main()
File "train.py", line 156, in main
auc, auprc, f1, logits, loss = test(testing_generator, model_max)
File "train.py", line 64, in test
loss = loss_fct(logits, label)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 530, in forward
return F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)
File "/home/xzhang/miniconda3/envs/DeepPurpose/lib/python3.7/site-packages/torch/nn/functional.py", line 2519, in binary_cross_entropy
"Please ensure they have the same size.".format(target.size(), input.size()))
ValueError: Using a target size (torch.Size([32])) that is different to the input size (torch.Size([16])) is deprecated. Please ensure they have the same size.

Thanks

Dataset

Hello, good morning. Greetings from Colombia! I like the work you have done in MolTrans.
I have some doubts with the datasets:

  • In the dataset there is a binary tag, with which the model is trained. This label tells if there is or not an interaction (1. interacts 0. Does not interact). I would like to know where you got this tag from, or what strategy you used for the creation of this binary tag.

  • How did you decide the interaction threshold (DTI)? was it deterministic? or a reference you already had? or what strategy did you use?
    Regards.

why CUDA error?

python train.py --task bindingdb

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

NVIDIA-SMI 455.45.01 Driver Version: 455.45.01 CUDA Version: 11.1

Can change this DTI model into DTA model?

Thank you for your repo. I think this model can be used as a DTA model if you change the problem from classification to regression. If it is correct, would you like to give us an example? Many thanks

ESPF Construction

Hello

Is there ipynb or py examples for constructing ESPF library? (ESPF folder in this repository)
I want to reproduce ESPF construction using public data. (for example, number of iterations for FCS mining)

Thanks in advance for reply.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.