Git Product home page Git Product logo

kinoml's People

Contributors

andreavolkamer avatar corey-taylor avatar glass-w avatar jaimergp avatar jchodera avatar jiayeguo avatar raquellrios avatar schallerdavid avatar t-kimber avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

kinoml's Issues

Multiprocessing ligand attributes lost

Currently, I am experimenting with providing a separate docking template to the OEPositDockingFeaturizer in the posit_template branch. This way you could dock into a structure from PDB entry 4aoj but bias the Posit docking algorithm by the ligand co-crystallized in another PDB entry, e.g. 4yne.

I pass the required information as attributes to the corresponding ligand and protein instances of the protein-ligand complex (see code example below). The Featurizer should be able to read these attributes to do a proper job. This process always worked out fine when passing the attributes to protein only in other Featurizers. However, adding additional attributes to the ligand instance gives surprising results when using multiprocessing. Anything else, but the smiles and name attributes (given during initialization) are lost.

from kinoml.core.components import BaseProtein
from kinoml.core.ligands import Ligand
from kinoml.core.systems import ProteinLigandComplex
from kinoml.features.complexes import OEPositDockingFeaturizer

compounds = {
    "larotrectinib": "C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F",
    "selitrectinib": "CC1CCC2=C(C=C(C=N2)F)C3CCCN3C4=NC5=C(C=NN5C=C4)C(=O)N1"
}

systems = []
for name, smiles in compounds.items():
    protein = BaseProtein(name="NTRK1")
    protein.pdb_id = "4aoj"
    protein.expo_id = "V4Z"
    protein.chain_id = "A"
    ligand = Ligand.from_smiles(smiles=smiles, name=name)
    ligand.docking_template_pdb_id = "4yne"  # lost in multiprocessing
    ligand.docking_template_expo_id = "4EK"  # lost in multiprocessing
    ligand.docking_template_chain_id = "A"  # lost in multiprocessing
    systems.append(ProteinLigandComplex(components=[protein, ligand]))

featurizer = OEPositDockingFeaturizer(output_dir="posit", use_multiprocessing=True)

systems = featurizer.featurize(systems)

Just googling this behavior gave me a few hints. It looks like, there may be a serialization problem.

Interestingly, this is not a problem when using the RDKitLigand class instead of the Ligand class to store the attributes. Since the Ligand class is based on the _OpenForceFieldMolecule class, the problem may arise on their end.

Memory usage during featurization

The recent changes in the featurization pipeline changed how the featurizers go through the different systems in a dataset.

Previously, a single system would go all the way through the pipeline (composed of N featurizers). This made it difficult to auto-detect global properties of the dataset (e.g. maximum length to pad bit-vectors to), so we refactored the pipeline so it traverses featurizers first.

# before
for system in systems:
    for featurizer in featurizers:
        featurizer.featurizer(system)

# now
for featurizer in featurizers:
   featurizer.featurize(systems)

This, however, implies that ALL the artifacts created by each featurizer coexist in time for the full dataset; aka, more memory usage. To give some numbers, ChEMBL28 (158K systems) peaks at around 6GB of RAM; mainly all the RDKit molecules that will be created from the SMILES. We do clear the featurizations dictionary after each pass by default (recent change), but I am writing this down as an issue because it might become a bottleneck for more complex schemes. These might require featurizing datasets in batches.

Ligand-based progress tracking

This issue will track all the progress related to getting the core functionality for ligand-based models merged to master.

Core functionality

  • DatasetProvider exports to PyTorch (see #8)
  • MeasurementType implements accurate observation models (see #8)
  • System is composed of typed MolecularComponents: Protein, Ligand, Complexes (see #8)
  • Featurizers can operate in batch mode or upon dataset access via DataLoader (see #8)
  • Filter invalid records (see note in #8 description)
  • Cross validation

Standard practices

  • Ensure documentation is accurate and up-to-date
    • Drop mkdocs for Sphinx + material theme?
  • Unit tests have good coverage and are passing

Consolidate previous PRs

  • Bring scripts from #6
  • Bring observation model notebook from #14

Pending scientific questions

  • Unknowns. These could be nuisance parameters in the future, but for now:
    • [substrate]: we could estimate by cross-validation, use relative pIC50s, or add that as a nuisance parameter in the future
    • uncertainties: estimate per dataset (like Kramer's paper) or measurement class
  • Verify % displacement equations. See discussion

OEDocking API deprecated

The OEDocking functionality is working with deprecated receptor objects. We should soon move to design units.

dead kernel for example notebooks without openeye license

Using OpenEye code without license leads to dead kernels. It would be great if the example notebooks can also be run without a valid OpenEye license or at least display a warning instead of a dying kernel without any further information.

  • getting_started.ipynb - cell 5 -> no dead kernel, but without toolkit="MDAnalysis" not working
  • kinoml_object_model - cell 7 -> dead kernel, will also likely happen below
  • OpenEye_structural_featurizer -> I guess failing is fine here, still a warning would be nice
  • Schrodinger_structural_featurizer -> reports missing SCHRODINGER installation at the appropriate level

I added an OpenEye license check in the Protein object initialization. However, this does not get used when using the from_pdb and from_file method. So one could add the checks there as well or find a more general solution.

In any case, if you stumble over dead kernels and don't have a valid OpenEye license specify toolkit="MDAnalysis" when working with Protein objects and DataSetProviders.

KinoML workflow

This issue concerns the general workflow of KinoML models.

#14 deals with the theory behind the modulable likelihood function.

  • available notebook example with data set included (ChEMBL v25 as toy example).

#13 deals with the API implementation.

Notebook tests with cell execution times

It would be great if our CI can provide information about cell execution times when testing the example notebooks. We use the nbval plugin for pytest, which does not support this feature currently.

Revisit caching strategy

Our current cache implementations rely on functools.lru_cache decorators, which admit a size keyword with the number of items to memoize. We can customize this (now hardcoded) value if we drop the decorator syntax sugar and decorate the memoized methods at __init__ "manually".

Refactor example notebooks

The example notebooks are quite out of date. We can integrate them in CI testing (PR #66).

  • think about application scenarios
  • remove not important notebooks
  • investigate notebook testing for CodeCov
  • later we can integrate tests in the docstrings

Parquet metadata

Still needs to include the measurement type, external indices and, if needed, provenance information for each kinase / ligand.

Travis CI Security Breach Notice

MolSSI is reaching out to every repository created from the MolSSI Cookiecutter-CMS with a .travis.yml file present to alert them to a potential security breach in using the Travis-CI service.

Between September 3 and September 10 2021, the Secure Environment Variables Travis-CI uses were leaked for ALL projects and injected into the publicly available runtime logs. See more details here. All Travis-CI users should cycle any secure variables/files, and associated objects as soon as possible. We are reaching out to our users in the name of good stewards of the third-party products we recommended and might still be in use and provide a duty-to-warn to our end-users given the potential severity of the breach.

We at MolSSI recommend moving away from Travis-CI to another CI provider as soon as possible. The nature of this breach and the way the response was mis-handled by Travis-CI, MolSSI cannot recommend the Travis-CI platform for any reason at this time. We suggest either GitHub Actions (as is used from v1.5 of the Cookiecutter-CMS) or some other service offered on GitHub.

If you have already addressed this security concern or it does not apply to you, feel free to close this issue.

This issue was created programmatically to reach as many potential end-users as possible. We do apologize if this was sent in error.

Check lint warnings

Currently, we run pylint with --disable=W because of E1102(not-callable) and E0401(import-error) errors. We should tackle those in the future.

Add CI badge to README

We should add a CI badge so we can easily see and access the state of the nightly CI tests.

DataSetProvider multiprocessing

Currently, DataSetProviders cannot control multiprocessing of given Featurizers.

  • multiprocessing is controlled at the Featurizer.featurize() level
  • need to change DataSetProvider.featurize() method to pass multiprocessing parameters to each Featurizer.featurize() call
  • usually, DataSetProviders also call PipelineFeaturizer and ConcatenatedFeaturizer, those need to be able to switch between multiprocessing and not-multiprocessing, depending on if the specific Featurizer inherits from BaseFeaturizer or ParallelBaseFeaturizer

PyTorch & Python 3.9 won't work together

We have disabled Python 3.9 testing because:

  • On Linux, it won't install properly
  • On MacOS, it will install, but the kernel will die immediately if you try import kinoml

We are using PyPI builds for PyTorch because conda-forge's won't work with torch-geometric. Until Geometric is on CF, there's no other way around!

kinoml installation

  1. Create conda environment using the yaml file
    conda env create --file devtools/conda-envs/test_env.yaml
  2. Activate conda environment
    conda activate test
  3. Install kinoml
    pip install -e . --no-deps

If, while running the example notebook (see kinoml/examples/ChEMBL.ipynb), the following error occurs:
AttributeError: 'Molecule' object has no attribute 'components'
in terminal, execute:
pip install https://github.com/openforcefield/openforcefield/archive/toolkit-inheritance.zip --no-deps

Pending issues from `api` PR

  • Structural object model based on MDA
  • Docking
    • Librarize code that needs so
    • Ensure / add tests
    • Ensure / add docs
  • Homology modeling
    • Librarize code that needs so
    • Ensure / add tests
    • Ensure / add docs
  • Dunbrack featurizer
    • Librarize code that needs so
    • Ensure / add tests
    • Ensure / add docs

missing sidechains

This issue reminds me that OESpruce will not always build all missing sidechains, especially if all rotamers in its rotamer library will lead to a clash with the already existing protein atoms. A possible solution would be to integrate pdbfixer for those situations, which does not care about atom clashes at all.

Code coverage

Update tests, ideally have 80% coverage.

Note: have a check for estimate input for forward method in pytorch models, see #53

ImportError/GLIBC issue using Openeye 2023.2.3

Noting this version-related issue for future posterity. Not necessarily an issue now but could be in the future:

Running anything related to OESpruce in the following manner (as an example):

abl1_systems = abl1_featurizer.featurize(abl1_systems)

causes an ImportError in python with the resulting stack trace (I'm only pasting the bottom of the trace due to length):

File ~/miniconda3/envs/kinoml/lib/python3.9/site-packages/openeye/libs/__init__.py:109, in OEGetModule(name)
    106 spec = importlib.machinery.PathFinder().find_spec(name, [OPENEYE_DLLS])
    108 # actually load the module
--> 109 mod = importlib.util.module_from_spec(spec)
    110 spec.loader.exec_module(mod)
    112 return mod

ImportError: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /home/lauj2/miniconda3/envs/kinoml/lib/python3.9/site-packages/openeye/libs/python3-linux-x64-g++10.x/liboespruce-1.5.3.3.so)

Looks like the latest versions of openeye-toolkits requires GLIBC >= 2.28. Currently our GLIBC on Lilac is version 2.17

This is resolved by installing openeye-toolkits versions 2021.2.0 up to 2023.1.1

It may be worth pinning the installation versions for now in the yaml installation file?

Memory performance

Converting our DatasetProviders to native torch.Dataset objects involve creating new torch.Tensors that can easily surpass available memory.

There are some strategies we can investigate to robustly minimize this issue, but it mainly involves using minibatches and compatible mechanisms. This requires Dataloader adapters, which would allow to convert to tensor only on __getindex__ access.

Structure-oriented object model

Let's start thinking of the requirements we will have. From our previous meeting on early July, we have these action items:

CC @schallerdavid @WG150

  • JRG/DaS/WG @ Fri
    • API foundations for structural objects
    • Start with MDAnalysis and implement idea tree in meeting notes
      -Investigate the design conversion as we go
  • JRG/AV/DaS @ in two weeks
    • Sequence renumbering discussion
  • JG/DaS/AV @ some point
    • Dunbrack conformations
  • WG/DaS/tbd @ some point
    • Loop modelling
    • Homology modelling

We will probably a lazy featurization scheme like in the ChEMBL notebook, either in real time, or via pre-featurizing everything offline and then accessing the providers to get data from disk on demand.

IO handling

How should we handle reading and writing, e.g. molecules? Would it be good to have an io subpackage? Also, loop modeling and system preparation with OE tk will benefit from reading the original pdb file, since openeye uses information stored in the HEADER section. So we would need an OpenEye specific io module.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.