openkinome / kinoml Goto Github PK
View Code? Open in Web Editor NEWStructure-informed machine learning for kinase modeling
Home Page: https://openkinome.org/kinoml/
License: MIT License
Structure-informed machine learning for kinase modeling
Home Page: https://openkinome.org/kinoml/
License: MIT License
Have class-based documentation with hierarchy.
In Sphinx docs.
See example:
https://open-forcefield-toolkit.readthedocs.io/en/0.10.0/topology.html
Source:
https://open-forcefield-toolkit.readthedocs.io/en/0.10.0/_sources/topology.md.txt
Currently, I am experimenting with providing a separate docking template to the OEPositDockingFeaturizer in the posit_template
branch. This way you could dock into a structure from PDB entry 4aoj but bias the Posit docking algorithm by the ligand co-crystallized in another PDB entry, e.g. 4yne.
I pass the required information as attributes to the corresponding ligand and protein instances of the protein-ligand complex (see code example below). The Featurizer should be able to read these attributes to do a proper job. This process always worked out fine when passing the attributes to protein only in other Featurizers. However, adding additional attributes to the ligand instance gives surprising results when using multiprocessing. Anything else, but the smiles and name attributes (given during initialization) are lost.
from kinoml.core.components import BaseProtein
from kinoml.core.ligands import Ligand
from kinoml.core.systems import ProteinLigandComplex
from kinoml.features.complexes import OEPositDockingFeaturizer
compounds = {
"larotrectinib": "C1CC(N(C1)C2=NC3=C(C=NN3C=C2)NC(=O)N4CCC(C4)O)C5=C(C=CC(=C5)F)F",
"selitrectinib": "CC1CCC2=C(C=C(C=N2)F)C3CCCN3C4=NC5=C(C=NN5C=C4)C(=O)N1"
}
systems = []
for name, smiles in compounds.items():
protein = BaseProtein(name="NTRK1")
protein.pdb_id = "4aoj"
protein.expo_id = "V4Z"
protein.chain_id = "A"
ligand = Ligand.from_smiles(smiles=smiles, name=name)
ligand.docking_template_pdb_id = "4yne" # lost in multiprocessing
ligand.docking_template_expo_id = "4EK" # lost in multiprocessing
ligand.docking_template_chain_id = "A" # lost in multiprocessing
systems.append(ProteinLigandComplex(components=[protein, ligand]))
featurizer = OEPositDockingFeaturizer(output_dir="posit", use_multiprocessing=True)
systems = featurizer.featurize(systems)
Just googling this behavior gave me a few hints. It looks like, there may be a serialization problem.
Interestingly, this is not a problem when using the RDKitLigand
class instead of the Ligand
class to store the attributes. Since the Ligand
class is based on the _OpenForceFieldMolecule
class, the problem may arise on their end.
The recent changes in the featurization pipeline changed how the featurizers go through the different systems in a dataset.
Previously, a single system would go all the way through the pipeline (composed of N featurizers). This made it difficult to auto-detect global properties of the dataset (e.g. maximum length to pad bit-vectors to), so we refactored the pipeline so it traverses featurizers first.
# before
for system in systems:
for featurizer in featurizers:
featurizer.featurizer(system)
# now
for featurizer in featurizers:
featurizer.featurize(systems)
This, however, implies that ALL the artifacts created by each featurizer coexist in time for the full dataset; aka, more memory usage. To give some numbers, ChEMBL28 (158K systems) peaks at around 6GB of RAM; mainly all the RDKit molecules that will be created from the SMILES. We do clear the featurizations dictionary after each pass by default (recent change), but I am writing this down as an issue because it might become a bottleneck for more complex schemes. These might require featurizing datasets in batches.
This issue will track all the progress related to getting the core functionality for ligand-based models merged to master
.
mkdocs
for Sphinx
+ material theme?[substrate]
: we could estimate by cross-validation, use relative pIC50s, or add that as a nuisance parameter in the futureThe OEDocking functionality is working with deprecated receptor objects. We should soon move to design units.
Using OpenEye code without license leads to dead kernels. It would be great if the example notebooks can also be run without a valid OpenEye license or at least display a warning instead of a dying kernel without any further information.
I added an OpenEye license check in the Protein object initialization. However, this does not get used when using the from_pdb
and from_file
method. So one could add the checks there as well or find a more general solution.
In any case, if you stumble over dead kernels and don't have a valid OpenEye license specify toolkit="MDAnalysis" when working with Protein objects and DataSetProviders.
Some resources:
sklearn
/ xgboost
and convert to PyTorch: https://towardsdatascience.com/transform-your-ml-model-to-pytorch-with-hummingbird-da49665497e7 (https://github.com/microsoft/hummingbird)In kinoml.features.core
we implement a few abstract classes. According to python best practices we should include the functionality from the built-in abc
library. However, this may also complicate testing methods defined in abstract classes.
It would be great if our CI can provide information about cell execution times when testing the example notebooks. We use the nbval plugin for pytest, which does not support this feature currently.
Our current cache implementations rely on functools.lru_cache
decorators, which admit a size
keyword with the number of items to memoize. We can customize this (now hardcoded) value if we drop the decorator syntax sugar and decorate the memoized methods at __init__
"manually".
The example notebooks are quite out of date. We can integrate them in CI testing (PR #66).
Still needs to include the measurement type, external indices and, if needed, provenance information for each kinase / ligand.
Updating Biotite to 0.34.0 fixes the issue; mainly relevant for users with existing installs of the kinoml environment.
MolSSI is reaching out to every repository created from the MolSSI Cookiecutter-CMS with a .travis.yml
file present to alert them to a potential security breach in using the Travis-CI service.
Between September 3 and September 10 2021, the Secure Environment Variables Travis-CI uses were leaked for ALL projects and injected into the publicly available runtime logs. See more details here. All Travis-CI users should cycle any secure variables/files, and associated objects as soon as possible. We are reaching out to our users in the name of good stewards of the third-party products we recommended and might still be in use and provide a duty-to-warn to our end-users given the potential severity of the breach.
We at MolSSI recommend moving away from Travis-CI to another CI provider as soon as possible. The nature of this breach and the way the response was mis-handled by Travis-CI, MolSSI cannot recommend the Travis-CI platform for any reason at this time. We suggest either GitHub Actions (as is used from v1.5 of the Cookiecutter-CMS) or some other service offered on GitHub.
If you have already addressed this security concern or it does not apply to you, feel free to close this issue.
This issue was created programmatically to reach as many potential end-users as possible. We do apologize if this was sent in error.
Currently, we run pylint with --disable=W because of E1102(not-callable) and E0401(import-error) errors. We should tackle those in the future.
We should add a CI badge so we can easily see and access the state of the nightly CI tests.
Currently, DataSetProviders cannot control multiprocessing of given Featurizers.
We have disabled Python 3.9 testing because:
import kinoml
We are using PyPI builds for PyTorch because conda-forge's won't work with torch-geometric
. Until Geometric is on CF, there's no other way around!
We should include the data set which contains over 17k measurements of compounds on the CDK2 kinase.
It may seem useful t include kincore (http://dunbrack.fccc.edu/kincore/) as a data provider and even use that to add the data as metadata for the Protein
objects.
docking.OEDocking and features.complexes have not been tested yet
Add structure-based object model based on MDAnalysis to replace PDB artifacts.
conda env create --file devtools/conda-envs/test_env.yaml
conda activate test
pip install -e . --no-deps
If, while running the example notebook (see kinoml/examples/ChEMBL.ipynb
), the following error occurs:
AttributeError: 'Molecule' object has no attribute 'components'
in terminal, execute:
pip install https://github.com/openforcefield/openforcefield/archive/toolkit-inheritance.zip --no-deps
This issue reminds me that OESpruce will not always build all missing sidechains, especially if all rotamers in its rotamer library will lead to a clash with the already existing protein atoms. A possible solution would be to integrate pdbfixer
for those situations, which does not care about atom clashes at all.
We should add a feature to our training notebooks to check if an environment variable is set (e.g. KINOML_NUM_THREADS
), and if so, call `torch.set_num_threads() with the appropriate number of threads. This will allow us to better work within batch queue systems.
Update tests, ideally have 80% coverage.
Note: have a check for estimate input for forward method in pytorch models, see #53
Noting this version-related issue for future posterity. Not necessarily an issue now but could be in the future:
Running anything related to OESpruce in the following manner (as an example):
abl1_systems = abl1_featurizer.featurize(abl1_systems)
causes an ImportError
in python with the resulting stack trace (I'm only pasting the bottom of the trace due to length):
File ~/miniconda3/envs/kinoml/lib/python3.9/site-packages/openeye/libs/__init__.py:109, in OEGetModule(name)
106 spec = importlib.machinery.PathFinder().find_spec(name, [OPENEYE_DLLS])
108 # actually load the module
--> 109 mod = importlib.util.module_from_spec(spec)
110 spec.loader.exec_module(mod)
112 return mod
ImportError: /lib64/libc.so.6: version `GLIBC_2.28' not found (required by /home/lauj2/miniconda3/envs/kinoml/lib/python3.9/site-packages/openeye/libs/python3-linux-x64-g++10.x/liboespruce-1.5.3.3.so)
Looks like the latest versions of openeye-toolkits
requires GLIBC
>= 2.28. Currently our GLIBC
on Lilac is version 2.17
This is resolved by installing openeye-toolkits
versions 2021.2.0
up to 2023.1.1
It may be worth pinning the installation versions for now in the yaml
installation file?
cc: openkinome/experiments-binding-affinity#25
It appears that DatasetProvider.to_awkward()
can create mixed-type union
arrays when attempting to create a list of arrays for each feature, even if the features are all homogeneous types themselves.
We need to fix this code to change this behavior.
Converting our DatasetProviders
to native torch.Dataset
objects involve creating new torch.Tensors
that can easily surpass available memory.
There are some strategies we can investigate to robustly minimize this issue, but it mainly involves using minibatches and compatible mechanisms. This requires Dataloader adapters, which would allow to convert to tensor only on __getindex__
access.
We should add code to our training notebooks to enable us to use CUDA for training if it is available.
From the espaloma example, we could add something like this:
if torch.cuda.is_available():
espaloma_model = espaloma_model.cuda()
Let's start thinking of the requirements we will have. From our previous meeting on early July, we have these action items:
CC @schallerdavid @WG150
We will probably a lazy featurization scheme like in the ChEMBL notebook, either in real time, or via pre-featurizing everything offline and then accessing the providers to get data from disk on demand.
The API docs are not up-to-date. How can we make sure they are on track with the latest commit on master
?
How should we handle reading and writing, e.g. molecules? Would it be good to have an io
subpackage? Also, loop modeling and system preparation with OE tk will benefit from reading the original pdb file, since openeye uses information stored in the HEADER section. So we would need an OpenEye specific io module.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.