openkinome / kinoml Goto Github PK

View Code? Open in Web Editor NEW

48.0 48.0 21.0 17.55 MB

Structure-informed machine learning for kinase modeling

Home Page: https://openkinome.org/kinoml/

License: MIT License

Python 15.47% Shell 0.01% Jupyter Notebook 84.52%

kinase-modeling machine-learning

kinoml's People

Contributors

Stargazers

Watchers

Forkers

aspirincode pipaj97 4383 shayaks masterwhook jchodera sukritsingh glass-w ijpulidos andreavolkamer schallerdavid jiayeguo t-kimber bbyun28 rnaimehaom raquellrios kaziaa mohammedyasserr

kinoml's Issues

Document hierarchy of molecular representation classes

Have class-based documentation with hierarchy.
In Sphinx docs.
See example:
https://open-forcefield-toolkit.readthedocs.io/en/0.10.0/topology.html
Source:
https://open-forcefield-toolkit.readthedocs.io/en/0.10.0/_sources/topology.md.txt

Multiprocessing ligand attributes lost

Currently, I am experimenting with providing a separate docking template to the OEPositDockingFeaturizer in the posit_template branch. This way you could dock into a structure from PDB entry 4aoj but bias the Posit docking algorithm by the ligand co-crystallized in another PDB entry, e.g. 4yne.

I pass the required information as attributes to the corresponding ligand and protein instances of the protein-ligand complex (see code example below). The Featurizer should be able to read these attributes to do a proper job. This process always worked out fine when passing the attributes to protein only in other Featurizers. However, adding additional attributes to the ligand instance gives surprising results when using multiprocessing. Anything else, but the smiles and name attributes (given during initialization) are lost.

from kinoml.core.components import BaseProtein
from kinoml.core.ligands import Ligand
from kinoml.core.systems import ProteinLigandComplex
from kinoml.features.complexes import OEPositDockingFeaturizer

compounds = {
    "larotrectinib": "C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F",
    "selitrectinib": "CC1CCC2=C(C=C(C=N2)F)C3CCCN3C4=NC5=C(C=NN5C=C4)C(=O)N1"
}

systems = []
for name, smiles in compounds.items():
    protein = BaseProtein(name="NTRK1")
    protein.pdb_id = "4aoj"
    protein.expo_id = "V4Z"
    protein.chain_id = "A"
    ligand = Ligand.from_smiles(smiles=smiles, name=name)
    ligand.docking_template_pdb_id = "4yne"  # lost in multiprocessing
    ligand.docking_template_expo_id = "4EK"  # lost in multiprocessing
    ligand.docking_template_chain_id = "A"  # lost in multiprocessing
    systems.append(ProteinLigandComplex(components=[protein, ligand]))

featurizer = OEPositDockingFeaturizer(output_dir="posit", use_multiprocessing=True)

systems = featurizer.featurize(systems)

Just googling this behavior gave me a few hints. It looks like, there may be a serialization problem.

Interestingly, this is not a problem when using the RDKitLigand class instead of the Ligand class to store the attributes. Since the Ligand class is based on the _OpenForceFieldMolecule class, the problem may arise on their end.

Memory usage during featurization

The recent changes in the featurization pipeline changed how the featurizers go through the different systems in a dataset.

Previously, a single system would go all the way through the pipeline (composed of N featurizers). This made it difficult to auto-detect global properties of the dataset (e.g. maximum length to pad bit-vectors to), so we refactored the pipeline so it traverses featurizers first.

# before
for system in systems:
    for featurizer in featurizers:
        featurizer.featurizer(system)

# now
for featurizer in featurizers:
   featurizer.featurize(systems)

This, however, implies that ALL the artifacts created by each featurizer coexist in time for the full dataset; aka, more memory usage. To give some numbers, ChEMBL28 (158K systems) peaks at around 6GB of RAM; mainly all the RDKit molecules that will be created from the SMILES. We do clear the featurizations dictionary after each pass by default (recent change), but I am writing this down as an issue because it might become a bottleneck for more complex schemes. These might require featurizing datasets in batches.

Ligand-based progress tracking

This issue will track all the progress related to getting the core functionality for ligand-based models merged to master.

Core functionality

DatasetProvider exports to PyTorch (see #8)
- We might need specific exports to PyTorch Geometric (very similar)
MeasurementType implements accurate observation models (see #8)
System is composed of typed MolecularComponents: Protein, Ligand, Complexes (see #8)
Featurizers can operate in batch mode or upon dataset access via DataLoader (see #8)
Filter invalid records (see note in #8 description)
Cross validation

Standard practices

Ensure documentation is accurate and up-to-date
- Drop mkdocs for Sphinx + material theme?
Unit tests have good coverage and are passing

Consolidate previous PRs

Bring scripts from #6
Bring observation model notebook from #14

Pending scientific questions

Unknowns. These could be nuisance parameters in the future, but for now:
- [substrate]: we could estimate by cross-validation, use relative pIC50s, or add that as a nuisance parameter in the future
- uncertainties: estimate per dataset (like Kramer's paper) or measurement class
Verify % displacement equations. See discussion

OEDocking API deprecated

The OEDocking functionality is working with deprecated receptor objects. We should soon move to design units.

dead kernel for example notebooks without openeye license

Using OpenEye code without license leads to dead kernels. It would be great if the example notebooks can also be run without a valid OpenEye license or at least display a warning instead of a dying kernel without any further information.

getting_started.ipynb - cell 5 -> no dead kernel, but without toolkit="MDAnalysis" not working
kinoml_object_model - cell 7 -> dead kernel, will also likely happen below
OpenEye_structural_featurizer -> I guess failing is fine here, still a warning would be nice
Schrodinger_structural_featurizer -> reports missing SCHRODINGER installation at the appropriate level

I added an OpenEye license check in the Protein object initialization. However, this does not get used when using the from_pdb and from_file method. So one could add the checks there as well or find a more general solution.

In any case, if you stumble over dead kernels and don't have a valid OpenEye license specify toolkit="MDAnalysis" when working with Protein objects and DataSetProviders.

PyTorch implementation for RF/XGB

Some resources:

Write the model in sklearn / xgboost and convert to PyTorch: https://towardsdatascience.com/transform-your-ml-model-to-pytorch-with-hummingbird-da49665497e7 (https://github.com/microsoft/hummingbird)
Use a native PyTorch tree implementation: https://github.com/ValentinFigue/Sklearn_PyTorch

Correctly handle abstract classes

In kinoml.features.core we implement a few abstract classes. According to python best practices we should include the functionality from the built-in abc library. However, this may also complicate testing methods defined in abstract classes.

KinoML workflow

This issue concerns the general workflow of KinoML models.

#14 deals with the theory behind the modulable likelihood function.

available notebook example with data set included (ChEMBL v25 as toy example).

#13 deals with the API implementation.

Notebook tests with cell execution times

It would be great if our CI can provide information about cell execution times when testing the example notebooks. We use the nbval plugin for pytest, which does not support this feature currently.

Revisit caching strategy

Our current cache implementations rely on functools.lru_cache decorators, which admit a size keyword with the number of items to memoize. We can customize this (now hardcoded) value if we drop the decorator syntax sugar and decorate the memoized methods at __init__ "manually".

Refactor example notebooks

The example notebooks are quite out of date. We can integrate them in CI testing (PR #66).

think about application scenarios
remove not important notebooks
investigate notebook testing for CodeCov
later we can integrate tests in the docstrings

Parquet metadata

Still needs to include the measurement type, external indices and, if needed, provenance information for each kinase / ligand.

Biotite 0.33.0 causes HTTP 404 error (when accessing PDB)

Updating Biotite to 0.34.0 fixes the issue; mainly relevant for users with existing installs of the kinoml environment.

Travis CI Security Breach Notice

MolSSI is reaching out to every repository created from the MolSSI Cookiecutter-CMS with a .travis.yml file present to alert them to a potential security breach in using the Travis-CI service.

Between September 3 and September 10 2021, the Secure Environment Variables Travis-CI uses were leaked for ALL projects and injected into the publicly available runtime logs. See more details here. All Travis-CI users should cycle any secure variables/files, and associated objects as soon as possible. We are reaching out to our users in the name of good stewards of the third-party products we recommended and might still be in use and provide a duty-to-warn to our end-users given the potential severity of the breach.

We at MolSSI recommend moving away from Travis-CI to another CI provider as soon as possible. The nature of this breach and the way the response was mis-handled by Travis-CI, MolSSI cannot recommend the Travis-CI platform for any reason at this time. We suggest either GitHub Actions (as is used from v1.5 of the Cookiecutter-CMS) or some other service offered on GitHub.

If you have already addressed this security concern or it does not apply to you, feel free to close this issue.

This issue was created programmatically to reach as many potential end-users as possible. We do apologize if this was sent in error.

Check lint warnings

Currently, we run pylint with --disable=W because of E1102(not-callable) and E0401(import-error) errors. We should tackle those in the future.

Add CI badge to README

We should add a CI badge so we can easily see and access the state of the nightly CI tests.

DataSetProvider multiprocessing

Currently, DataSetProviders cannot control multiprocessing of given Featurizers.

multiprocessing is controlled at the Featurizer.featurize() level
need to change DataSetProvider.featurize() method to pass multiprocessing parameters to each Featurizer.featurize() call
usually, DataSetProviders also call PipelineFeaturizer and ConcatenatedFeaturizer, those need to be able to switch between multiprocessing and not-multiprocessing, depending on if the specific Featurizer inherits from BaseFeaturizer or ParallelBaseFeaturizer

PyTorch & Python 3.9 won't work together

We have disabled Python 3.9 testing because:

On Linux, it won't install properly
On MacOS, it will install, but the kernel will die immediately if you try import kinoml

We are using PyPI builds for PyTorch because conda-forge's won't work with torch-geometric. Until Geometric is on CF, there's no other way around!

Try using mkdocs instead of sphinx

"Buried treasure" data set

Data set

We should include the data set which contains over 17k measurements of compounds on the CDK2 kinase.

References:

The data set can be found in the SI of https://doi.org/10.1021/jm020472j
RDKit associated blogpost: https://rdkit.blogspot.com/2019/11/a-buried-data-treasure.html

Include Kincore information as a data provider

It may seem useful t include kincore (http://dunbrack.fccc.edu/kincore/) as a data provider and even use that to add the data as metadata for the Protein objects.

Testing of OEdocking

docking.OEDocking and features.complexes have not been tested yet

Add structure-based object model based on MDAnalysis to replace PDB artifacts

Add structure-based object model based on MDAnalysis to replace PDB artifacts.

~~serialzation of mda.Universe currently only available in unstable release~~
mdanalysis 2.0.0 available on conda-forge

kinoml installation

Create conda environment using the yaml file
conda env create --file devtools/conda-envs/test_env.yaml
Activate conda environment
conda activate test
Install kinoml
pip install -e . --no-deps

If, while running the example notebook (see kinoml/examples/ChEMBL.ipynb), the following error occurs:
AttributeError: 'Molecule' object has no attribute 'components'
in terminal, execute:
pip install https://github.com/openforcefield/openforcefield/archive/toolkit-inheritance.zip --no-deps

Pending issues from `api` PR

missing sidechains

This issue reminds me that OESpruce will not always build all missing sidechains, especially if all rotamers in its rotamer library will lead to a clash with the already existing protein atoms. A possible solution would be to integrate pdbfixer for those situations, which does not care about atom clashes at all.

Allow environment variable to set number of threads to use

We should add a feature to our training notebooks to check if an environment variable is set (e.g. KINOML_NUM_THREADS), and if so, call `torch.set_num_threads() with the appropriate number of threads. This will allow us to better work within batch queue systems.

Code coverage

Update tests, ideally have 80% coverage.

Note: have a check for estimate input for forward method in pytorch models, see #53

ImportError/GLIBC issue using Openeye 2023.2.3

Noting this version-related issue for future posterity. Not necessarily an issue now but could be in the future:

Running anything related to OESpruce in the following manner (as an example):

abl1_systems = abl1_featurizer.featurize(abl1_systems)

causes an ImportError in python with the resulting stack trace (I'm only pasting the bottom of the trace due to length):

File ~/miniconda3/envs/kinoml/lib/python3.9/site-packages/openeye/libs/__init__.py:109, in OEGetModule(name)
    106 spec = importlib.machinery.PathFinder().find_spec(name, [OPENEYE_DLLS])
    108 # actually load the module
--> 109 mod = importlib.util.module_from_spec(spec)
    110 spec.loader.exec_module(mod)
    112 return mod

ImportError: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /home/lauj2/miniconda3/envs/kinoml/lib/python3.9/site-packages/openeye/libs/python3-linux-x64-g++10.x/liboespruce-1.5.3.3.so)

Looks like the latest versions of openeye-toolkits requires GLIBC >= 2.28. Currently our GLIBC on Lilac is version 2.17

This is resolved by installing openeye-toolkits versions 2021.2.0 up to 2023.1.1

It may be worth pinning the installation versions for now in the yaml installation file?

Ensure we create arrays of homogeneous type in DatasetProvider.to_awkward()

cc: openkinome/experiments-binding-affinity#25

It appears that DatasetProvider.to_awkward() can create mixed-type union arrays when attempting to create a list of arrays for each feature, even if the features are all homogeneous types themselves.

We need to fix this code to change this behavior.

Memory performance

Converting our DatasetProviders to native torch.Dataset objects involve creating new torch.Tensors that can easily surpass available memory.

There are some strategies we can investigate to robustly minimize this issue, but it mainly involves using minibatches and compatible mechanisms. This requires Dataloader adapters, which would allow to convert to tensor only on __getindex__ access.

Enable use of CUDA for training if available

We should add code to our training notebooks to enable us to use CUDA for training if it is available.

From the espaloma example, we could add something like this:

if torch.cuda.is_available():
    espaloma_model = espaloma_model.cuda()

Structure-oriented object model

Let's start thinking of the requirements we will have. From our previous meeting on early July, we have these action items:

CC @schallerdavid @WG150

JRG/DaS/WG @ Fri
- API foundations for structural objects
- Start with MDAnalysis and implement idea tree in meeting notes
  -Investigate the design conversion as we go
JRG/AV/DaS @ in two weeks
- Sequence renumbering discussion
JG/DaS/AV @ some point
- Dunbrack conformations
WG/DaS/tbd @ some point
- Loop modelling
- Homology modelling

We will probably a lazy featurization scheme like in the ChEMBL notebook, either in real time, or via pre-featurizing everything offline and then accessing the providers to get data from disk on demand.

API docs are not uptodate

The API docs are not up-to-date. How can we make sure they are on track with the latest commit on master?

IO handling

How should we handle reading and writing, e.g. molecules? Would it be good to have an io subpackage? Also, loop modeling and system preparation with OE tk will benefit from reading the original pdb file, since openeye uses information stored in the HEADER section. So we would need an OpenEye specific io module.