Git Product home page Git Product logo

map4's People

Contributors

alicecapecchi avatar iwatobipen avatar richardjgowers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

map4's Issues

AttributeError: module 'tmap' has no attribute 'Minhash'

Traceback (most recent call last):
  File "/home/berenger/src/map4/map4/map4.py", line 182, in <module>
    main()
  File "/home/berenger/src/map4/map4/map4.py", line 137, in main
    calculator = MAP4Calculator(args.dimensions, args.radius, args.is_counted, args.is_folded)
  File "/home/berenger/src/map4/map4/map4.py", line 33, in __init__
    self.encoder = tm.Minhash(dimensions)
AttributeError: module 'tmap' has no attribute 'Minhash'

dimensions is not work when is_folded is True

MAP4Calculator correction:

#!/usr/bin/env python

import argparse
import itertools
from collections import defaultdict

import tmap as tm
from mhfp.encoder import MHFPEncoder
from rdkit import Chem
from rdkit.Chem import rdmolops
from rdkit.Chem.rdmolops import GetDistanceMatrix


def to_smiles(mol):
    return Chem.MolToSmiles(mol, canonical=True, isomericSmiles=False)


class MAP4Calculator:

    def __init__(self, dimensions=1024, radius=2, is_counted=False, is_folded=False):
        """
        MAP4 calculator class
        """
        self.dimensions = dimensions
        self.radius = radius
        self.is_counted = is_counted
        self.is_folded = is_folded

        if self.is_folded:
            self.encoder = MHFPEncoder(dimensions)
        else:
            self.encoder = tm.Minhash(dimensions)

    def calculate(self, mol):
        """Calculates the atom pair minhashed fingerprint
        Arguments:
            mol -- rdkit mol object
        Returns:
            tmap VectorUint -- minhashed fingerprint
        """
        
        atom_env_pairs = self._calculate(mol)
        if self.is_folded:
            return self._fold(atom_env_pairs)
        return self.encoder.from_string_array(atom_env_pairs)

    def calculate_many(self, mols):
        """ Calculates the atom pair minhashed fingerprint
        Arguments:
            mols -- list of mols
        Returns:
            list of tmap VectorUint -- minhashed fingerprints list
        """

        atom_env_pairs_list = [self._calculate(mol) for mol in mols]
        if self.is_folded:
            return [self._fold(pairs) for pairs in atom_env_pairs_list]
        return self.encoder.batch_from_string_array(atom_env_pairs_list)

    def _calculate(self, mol):
        return self._all_pairs(mol, self._get_atom_envs(mol))

    def _fold(self, pairs):
        fp_hash = self.encoder.hash(set(pairs))
        return self.encoder.fold(fp_hash, self.dimensions)

    def _get_atom_envs(self, mol):
        atoms_env = {}
        for atom in mol.GetAtoms():
            idx = atom.GetIdx()
            for radius in range(1, self.radius + 1):
                if idx not in atoms_env:
                    atoms_env[idx] = []
                atoms_env[idx].append(MAP4Calculator._find_env(mol, idx, radius))
        return atoms_env

    @classmethod
    def _find_env(cls, mol, idx, radius):
        env = rdmolops.FindAtomEnvironmentOfRadiusN(mol, radius, idx)
        atom_map = {}

        submol = Chem.PathToSubmol(mol, env, atomMap=atom_map)
        if idx in atom_map:
            smiles = Chem.MolToSmiles(submol, rootedAtAtom=atom_map[idx], canonical=True, isomericSmiles=False)
            return smiles
        return ''

    def _all_pairs(self, mol, atoms_env):
        atom_pairs = []
        distance_matrix = GetDistanceMatrix(mol)
        num_atoms = mol.GetNumAtoms()
        shingle_dict = defaultdict(int)
        for idx1, idx2 in itertools.combinations(range(num_atoms), 2):
            dist = str(int(distance_matrix[idx1][idx2]))

            for i in range(self.radius):
                env_a = atoms_env[idx1][i]
                env_b = atoms_env[idx2][i]

                ordered = sorted([env_a, env_b])

                shingle = '{}|{}|{}'.format(ordered[0], dist, ordered[1])

                if self.is_counted:
                    shingle_dict[shingle] += 1
                    shingle += '|' + str(shingle_dict[shingle])

                atom_pairs.append(shingle.encode('utf-8'))
        return list(set(atom_pairs))

Any plans to have map4 in pip?

First of all, very interesting paper, I'm looking forward to test the performance of map4 myself. Any plans to have it in pip? That would make life a lot easier for those trying to implement it and share it on a notebook (!pip install map4 --user is definitely simpler than having users have to create their own virtual env and make their notebooks aware of it). Also, your group's tmap conflicts with pip's tmap, are there any plans to address this?

Thanks!

Is it possible to draw shingle on the molecular structure for a specific bit?

Thank you very much for sharing your great work.

I have a question regarding MAP4 as bit vector.

When we have MAP4 as bit vector, is it possible to identify every bit is corresponding to which shingle and especially draw the shingle on molecular structure?

(similar to the function rdkit.Chem.Draw.DrawMorganBit() in RDKit library)

isomericSmiles=False

First, thank you for your awesome work!

I'm wondering why you set isomericSmiles=False in file map4.py at Line 15.

From my perspective, the stereochemistry information is quite important in Molecular representation. In repo rdkit, there is an optional argument called useChirality when obtain Morgan Fingerprint.

So, if the code still works if I manually set isomericSmiles to True?

Test Code Below

import tmap as tm
from map4 import MAP4Calculator

MAP4 = MAP4Calculator(is_folded=True)

smiles_a = 'n1nn2N=C(C=C(c2n1)N)C(=O)Nc3c(cc(c(c3)C#C)F)C(=O)O[C@H]4[C@H](O)CSSC4'
mol_a = Chem.MolFromSmiles(smiles_a)
map4_a = MAP4.calculate(mol_a)

smiles_b = 'n1nn2N=C(C=C(c2n1)N)C(=O)Nc3c(cc(c(c3)C#C)F)C(=O)O[C@@H]4[C@H](O)CSSC4'  # '@' -> '@@'
mol_b = Chem.MolFromSmiles(smiles_b)
map4_b = MAP4.calculate(mol_b)

print(sum(map4_a == map4_b), MAP4.dimensions)

# [isomericSmiles=False output]: (1024, 1024)
# [isomericSmiles=True output]: (921, 1024)

If a molecule has electric charge , some shingles have empty smiles

Hi, thank you for sharing your great work.

I have tried MAP for a molecule with electric charge as below:

smile = '[I-].CC[N+](CC)(CC)c1ccccc1'
mol = Chem.MolFromSmiles(smile)
radius = 2
nBits = 2048

# set "is_folded=False" & "return_strings=True" to get shingles
MAP_c_s = MAP4Calculator(dimensions=nBits, radius = radius, is_folded=False, return_strings=True)
map_s = MAP_c_s.calculate(mol)   
map_s

There are shingles that have empty string, for example:

b'|100000000|c(c)(c)[N+]'

As can be seen, left smile is an empty string and topological distance is 100000000. It looks strange to me. I would like to ask if it is designed to be like above or there is no reason for it.

Thank you for your time to read my question.

List of shingles changes when I restart PC

Hi, Thank you for sharing your great work.

I am trying to explain every bit in MAP FingerPrint as bit vector (bits with value 1) for the purpose of explainability of ML model trained on MAP fingerprint.

I am trying to find which shingle is related to a specific bit in bit vector by reverse modulo operation. I had a difficulty which is explained below. I am very sorry that my question is very long. Thank you for your time to read my question.


I get shingles by setting "return_strings=True" for a specific molecule. It returns a list of shingles.

I noticed that when I restart PC, I get list of shingles but in a different order.

For example, one time I get shingles as [shingle A, shingle B, shingle C]. But, the next day, I get [shingle C, shingle A, shingle B].

Also, when I apply "MHFPEncoder.hash" to get SHA-1 hash values for shingles, sometimes (not always) I get different values after restarting PC. For example, for shingle A, I get SHA-1 value of 962277111. The other day, I get SHA-1 value of 3766998768 for the same shingle.

It is interesting that when I get bit vector (by setting is_folder = True), bit vector is the same for a specific molecule. There is no change in bit vector.
Bit vector is obtained by modulo operation as below:

folded = np.zeros(nBits, dtype=np.uint8)
folded[hash_values % nBits] = 1 

Since hash values change after restarting PC, remainder values will change. And therefore, different indexes in "folded" vector will get value of 1.

My conclusion:
Every time I get bit vector, the bits with value "1" may refer to different shingles. For example, if bit vector is:
[0 1 0 0 1 ......... 0 1]

The second bit has value of 1. It may be related to shingle A. But, after restarting PC, it may represent shingle B.
Could you please let me know if my understanding is correct?

I guess the formula used to convert hash values into integer will sometimes (not always) produce different integer values for the same shingle after restarting PC:

        hash_values = []

        for t in shingling:
            hash_values.append(struct.unpack("<I", sha1(t).digest()[:4])[0])

When I use "MHFPEncoder", ".hash" method and the above formula, sometimes I get different hash values for the same shingles.


My purpose is to find which shingle is related to a specific bit. If fingerprint representation changes by restarting PC or python kernel, it will not be possible to relate specific bit to a specific shingle or shingles in case of bit collision.

I really appreciate it if you have any idea to help me on this issue.

Dependency on Python 3.6 [suggestion]

Hi,

I remember when I installed MAP, I should use Python 3.6. Is there any plan to remove this dependency so we can use newer python versions?

This dependency is because of importing tmap library (I think). If tmap is replaced with the MinHash formula (explained in the paper), I think there is no need to import tmap, and dependency will be removed. It is only a suggestion.

how to use MAP similarity search for databases other than Chembl

Hello,

Thanks for sharing your great work.
I was wondering if it is possible to use a custom build library for similarity search. In running the fingerprint from terminal how can I point to another library (map4.py -i smilesfile.smi -o outputfile --some other library).

Can you please provide an example on how to do this.

Thanks,
Best,
Amir

can dim be larger than 2048?

Somewhere you mentioned that the embedding dimension can be (128, 256, 512, 1024, 2048). Can I use a larger number, such as 8192? Thanks

Clustering

Dear Madam/Sir,

Thank you very much for putting together this repo and for explaining how to use it.

I have around 1000 fingerprints calculated and I was wondering if there is any direct way to cluster them based on a similarity threshold

Problem with installation

Hello,

Thank you for your very interesting work! I am excited to give these fingerprints calculations a try, yet I have not succeeded with the installation so far. My knowledge of python is quite limited but it seems like there is a problem with some packages not being compatible with others and the python version. I've tried with python versions 3.9.7 and 3.6.5. I saw some people having problems with the "tmap" package and also tried installing that separately with no success. Should I have a specific version of python for the MAP4 installation?

Best,
Helen

facing issue while running MAP4

Hi, I am very much interested to try this MAP4 fingerprint but I am facing a problem while running this in Jupyter notebook. It's showing an error "tm.Minhash(dimensions)" while trying to execute MAP4. First I think, its a problem of tmap, but there is something else. Please help me.
Any kind of suggestion is helpful for me.

Relationship between distance & similarity?

I would like to confirm the relationship between distance and similarity of molecules. When we calculate distance as below:

ENC = tm.Minhash(dim)
ENC.get_distance(map4_a, map4_b)

I guess similarity will be:

similarity = 1 - distance

I really appreciate it if you could confirm the similarity calculation.
Thank you.

Allow a range of python version

Hi,

I'm not deeply familiarized with conda, but could it be possible to allow the library to be installed in a wider range of Python versions?

Looking at the package match specification documentation, I think it could possibly be achieved by modifying this line in the environment.yml file. ( Please correct me if I'm wrong )

- python=3.6

To something like the following

 - python>=3.6,<=3.9

or with a space

 - python >=3.6,<=3.9

Thanks!

How to preprocess MAP4 before training?

Hello,

I tried today to train a simple Classifier MLP to predict the bioactivity of a set of small molecules (not peptidomimetics of peptides). This is how I preprocess the MAP4 vectors before training:
x = np.array(MAP4.calculate_many(mol_list), dtype=np.int) ColorPrint("Scaling features in the range [0,1].", "OKBLUE") scaler = MinMax_Scaler() # scaler with memory, to be used later on the xtest x = scaler.fit_transform(x) ColorPrint("Removing only uniform features.", "OKBLUE") x = remove_uniform_columns(x)
The x 2D array is the input to the MLP. Oddly enough, the performance of the MLP in 5-fold cross-validation is poorer than any other fingerprint that I tested. See the results below:

Results for feature vector type ECFPL: average AUC-ROC=0.754875+-0.061359 average DOR=16.044811+-13.862254 average MK=0.509749+-0.122718
Results for feature vector type FCFPL: average AUC-ROC=0.754583+-0.067879 average DOR=17.411853+-14.740475 average MK=0.509166+-0.135758
Results for feature vector type AvalonFPL: average AUC-ROC=0.755908+-0.072506 average DOR=22.775238+-29.836969 average MK=0.511817+-0.145013
Results for feature vector type gCSFP: average AUC-ROC=0.733945+-0.072095 average DOR=14.667879+-16.226123 average MK=0.467889+-0.144191
Results for feature vector type CSFPL: average AUC-ROC=0.716520+-0.053522 average DOR=8.701626+-4.768975 average MK=0.433040+-0.107043
Results for feature vector type tCSFPL: average AUC-ROC=0.719829+-0.058146 average DOR=8.996485+-5.867861 average MK=0.439658+-0.116293
Results for feature vector type iCSFPL: average AUC-ROC=0.768058+-0.091553 average DOR=58.927698+-100.427486 average MK=0.536115+-0.183106
Results for feature vector type fCSFPL: average AUC-ROC=0.741297+-0.062825 average DOR=13.043251+-8.988206 average MK=0.482595+-0.125651
Results for feature vector type pCSFPL: average AUC-ROC=0.749496+-0.063083 average DOR=14.137692+-10.776114 average MK=0.498991+-0.126165
Results for feature vector type gCSFPL: average AUC-ROC=0.738966+-0.083883 average DOR=17.951455+-21.426471 average MK=0.477931+-0.167767
Results for feature vector type AP: average AUC-ROC=0.779686+-0.067921 average DOR=23.284337+-19.439457 average MK=0.559371+-0.135841
Results for feature vector type cAP: average AUC-ROC=0.785569+-0.056231 average DOR=21.165699+-13.829002 average MK=0.571139+-0.112462
Results for feature vector type TT: average AUC-ROC=0.728741+-0.093040 average DOR=26.068889+-41.656358 average MK=0.457481+-0.186081
Results for feature vector type cTT: average AUC-ROC=0.725244+-0.087174 average DOR=21.606984+-33.135657 average MK=0.450488+-0.174347
Results for feature vector type ErgFP: average AUC-ROC=0.722143+-0.042805 average DOR=7.991111+-3.364458 average MK=0.444285+-0.085610
Results for feature vector type 2Dpp: average AUC-ROC=0.754694+-0.036142 average DOR=13.907273+-8.282393 average MK=0.509388+-0.072285
Results for feature vector type MAP4: average AUC-ROC=0.713780+-0.055750 average DOR=8.804872+-4.667274 average MK=0.427561+-0.111501

Am I doing something wrong in the preparation of the MAP4 feature vectors? Is this the right way to train a network using MAP4 as input? I am asking this question because I read in the documentation that due to MinHashing, the order of the features matters and the distance cannot be calculated "feature-wise". I wonder if this attribute affects also the neural network's training.

Thanks in advance.
Thomas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.