Comments (18)
For example, a Fragmenter instance could memoize (remember) input molecules so it can look up their FragmentSet if it has already been fragmented.
links to the fragments which could be accessed as
mol1_fragments = factory.molecules[0].fragments
I do not think the factory should store the parent molecule and fragments
The object created by the factory, however, could contain the parent and daughter fragments in a convenient data object model, however---call it a FragmentResult or FragmentedMolecule.
In light of the planned integration of Fragmenter into OFFTK, I agree with @jchodera's last proposal about not keeping links between the original molecule and the fragmentation results. Keeping/caching those relationships would introduce all sorts of surprising behaviors and edge cases -- For example a user who wants to fragment one million molecules would get stuck with a huge memory footprint, even if they wrote out the fragments after processing each molecule. Also, since Molecule
s are to some extent mutable, the user could change the parent or fragments in several ways that could invalidate a cached result (for example, manually setting the WBOs on a parent molecule after fragmenting it once). My biggest lesson so far in this role is that users will explore literally every extreme case within weeks of a feature release, so it's best to provide them with simple functionality, where the solution to memory/processing limits are in their hands.
If we wanted to also bundle the results of fragmentation---including the computed WBOs for the parent and child molecules---we could do that through a FragmenterResult object:
I like this idea, with the added bonus that the FragmenterResult
could contain a mapping from the whole-molecule atom indices to the fragment's -- Especially helpful if we need to keep track of the 4 atoms involved in a torsion that we want to scan. This would avoid the potential memory explosion by having the FragmenterResult
object go out of scope if the user doesn't care about it.
from openff-fragmenter.
But do we really need to preserve this relationship by bundling everything in an object? What information would be useful to bundle together, and can we make a compelling case for why?
The only reason to maintain the relationship between the parent and daughter molecule is to know which torsions to drive for bespoke torsion fitting because the fragments are generated for the rotatable bond. For general forcefield parameterization, there is no need for that relationship because all torsions (or combination of torsions) will be driven. I like the idea of having the FragmenterResults
have the option to maintain this relationship.
from openff-fragmenter.
The only reason to maintain the relationship between the parent and daughter molecule is to know which torsions to drive for bespoke torsion fitting because the fragments are generated for the rotatable bond.
I'm not sure I get this. My understanding is that the input to the fragmenter algorithm includes the torsion to fragment around, which would mean we start with that information.
Or do you mean that this would be useful for a different API call in which you ask it to "fragment this whole molecule", which would then iteratively fragment around each rotatable bond? In that case, we'd like to know which fragments go with which bond.
from openff-fragmenter.
I've been playing around with this a bit more and it seems to make sense to make the fragmenter class inherit directly from a pydantic base model. In this way options can be easily changed like normal properties:
@SimonBoothroyd : This is great! I was just looking into this, and it sounds like this is the best way to go. Any class fields prepended with _
will not be serialized by default, so it should still be possible to use class fields for temporary or cached storage without issues but have the class fields directly be serialzed and still use a Pythonic factory idiom.
from openff-fragmenter.
This also would enable a bunch of easy optimizations.
For example, a Fragmenter
instance could memoize (remember) input molecules so it can look up their FragmentSet
if it has already been fragmented.
A future extension could also use multiprocessing
under the hood to fragment multiple molecules on multiple threads by adding support to Fragment.fragment()
that would accept multiple molecules, rather than just one. Or it could be parallelized across bonds in the same molecule.
from openff-fragmenter.
@jchodera @ChayaSt @j-wags I wanted to restart this conversation about the Fragmenter API ahead of the planned integration into the OFFtoolkit. Following the idea of rebasing the code to use the Molecule
class from the toolkit it might be nice to have a method like Molecule.wbo_fragment(kwargs)
, however, based on all of the key arguments listed above this function could have a lot of options which might awkward to change.
I also like the idea of the FragmenterFactory
class which could house all of the running options and remember fragments of molecules already processed with the same settings, multiprocessing
would also be a nice addition at least at the molecule level. The FragmenterFactory
could then keep all of the fragmented molecules with links to the fragments which could be accessed as mol1_fragments = factory.molecules[0].fragments
, also if the class was pickled we would still know which molecules the fragments are related to. We could also have some post fragmentation filtering options this way which could remove duplicates etc.
I am not fully sure yet of how Fragmenter
will be used so maybe this will make getting the fragments awkward, anything else we should consider?
from openff-fragmenter.
We want to avoid making the fragmentation scheme be a method of Molecule
because there is no unambiguous fragmentation scheme---it's still very much an active area of research. Molecule
should only contain convenience methods that are stable, clearly defined processes, like conversion to/from other formats or clear manipulation methods.
I think creating a Fragmenter
factory class is the right way to go. (It doesn't need to be called FragmenterFactory
since its API will make it clear it is a factory.) If we create a common Fragmenter
base class with a simple, clear API, we can have multiple implementations of this via subclasses (e.g. WBOFragmenter
) that can easily be swapped out.
I do not think the factory should store the parent molecule and fragments. That's not any sort of standard idiom or design pattern I'm familiar with. Instead, the factory should be something you create, configure, and use to create new objects that are returned by a factory method. The object created by the factory, however, could contain the parent and daughter fragments in a convenient data object model, however---call it a FragmentResult
or FragmentedMolecule
.
Writing out a few clear use cases is the easiest way to approach design.
For example, consider a fragmenter that just returns new Molecule
objects. We want to fragment a bunch of parent molecules and compile a unique set of fragments:
# Create a Fragmenter
fragmenter = WBOFragmenter(wbo_threshold=0.8, capping_rules='default')
# Further configure the fragmenter
fragmenter.set_wbo_threshold(0.7)
# Fragment a bunch of molecules
all_fragments = set()
for molecule in molecules:
fragments = fragmenter.fragment(molecule)
all_fragments.add(fragments)
That could be very useful on its own.
If we wanted to also bundle the results of fragmentation---including the computed WBOs for the parent and child molecules---we could do that through a FragmenterResult
object:
# Create a Fragmenter
fragmenter = WBOFragmenter(wbo_threshold=0.8, capping_rules='default')
# Further configure the fragmenter
fragmenter.set_wbo_threshold(0.7)
# Fragment a bunch of molecules
all_fragments = set()
for molecule in molecules:
fragmented_molecule = fragmenter.fragment(molecule)
all_fragments.add(fragmented_molecule.fragments)
# we can also inspect the parent molecule
parent = fragmented_molecule.parent_molecule
etc. But do we really need to preserve this relationship by bundling everything in an object? What information would be useful to bundle together, and can we make a compelling case for why?
Filtering can simply be done by using a set()
and adding molecules to the set. I believe we use the canonical isomeric SMILES for uniqueness checks, so no new machinery is needed to filter out duplicates.
from openff-fragmenter.
Are we set on integrating the Fragmenter
into OFFTK? I thought we had figured out a way to keep the fragmenter
package separate, since there is still a lot of science to be done there (even if we have v1 of a production algorithm ready now).
from openff-fragmenter.
@jchodera I recall that Fragmenter was planned for integration into OFFTK, but CMILES was slated to remain largely independent. I can dig up the meeting notes for that tonight if we'd like to reevaluate.
from openff-fragmenter.
I think it could still work do to this, provided we do so in a way that makes it easy to extend via subclassing and still use externally developed variants of Fragmenter
without modifying the code in openforcefield
.
from openff-fragmenter.
I think with the bespoke fitting for a molecule with multiple bonds users will want to fragment around all of them and refit the parameters which is why I want to return the relational information for each torsion. If this was not built into fragmenter it could instead be part of the bespoke fitting package which could make multiple calls to fragmenter and record the relation its self, but I think this functionality would make more sense in fragmenter.
from openff-fragmenter.
Currently, the torsion to fragment around is not part of fragmenter
s input. The first thing it does is find the rotatable bonds to fragment around and then maps those to the new fragment.
The issue is that the rotatable bonds are numbered with the map indices of the parent molecule but the daughter fragments need to have their own map indices. Another way to deal with this without keeping the parent - daughter relationship is to have fragmenter
find that torsion in the resulting fragment before returning the fragments.
This happens in the method that generates the torsiondrive inputs and it uses the parent map indices that are still on the fragment. The provenance still retains the parent molecule map indices because it is used when validating fragmentation schemes.
Validating fragmentation schemes is another reason to retain the parental relationship but since it will probably not happen very often it is not that important. The relationship is needed when validating because it makes it straight forward to find all fragments that have the bond of interest.
It can be a good idea to have the torsion of interest be part of the input because that will reduce the cost of fragmenting in general.
from openff-fragmenter.
That might be a nice extension to the API to have the option to fragment all or fragment around a list of user-supplied bonds. In the future, we may have some method of estimating the performance of a set of parameters applied to a molecule (for bespoke fitting), which could indicate some suspect torsions that should be refit. In this case, only fragmenting around these bonds would be more efficient.
So when fragmenting one molecule we could
# Create a Fragmenter
fragmenter = WBOFragmenter(wbo_threshold=0.8, capping_rules='default')
# Further configure the fragmenter
fragmenter.set_wbo_threshold(0.7)
# fragment our molecule around target bonds
target_bonds = [(0,1), (3,4)]
fragmented_molecules = fragmenter.fragment(molecule, target_bonds=target_bonds)
The default use of the fragmenter would then still fragment around all rotable bonds.
fragmented_molecules = fragmenter.fragment(molecule, target_bonds='all')
from openff-fragmenter.
Can you folks make some concrete proposals about the API you're envisioning here? Simple examples illustrating use (like my example above) make it much easier to see what you're proposing.
from openff-fragmenter.
My proposal for the API is:
parent_molecule = Molecule.from_smiles(
"OC1(CN(C1)C(=O)C1=C(NC2=C(F)C=C(I)C=C2)C(F)=C(F)C=C1)[C@@H]1CCCCN1"
)
fragmenter = WBOFragmenter(
keep_non_rotor_ring_substituents=False,
threshold=0.03,
heuristic="path_length"
)
fragmentation_result = fragmentater.fragment(parent_molecule)
print(fragmentation_result.fragments)
print(fragmentation_result.provenance)
The fragmenter settings can be changed on the object directly:
fragmenter.threshold = 0.07
fragmentation_result = fragmentater.fragment(parent_molecule)
The fragmentation engine will inherit from the pydantic
base model allowing the options it uses to be readily serialisable. This would then enable us to easily include the general schema for how a molecule should be fragmented in other workflow schemas (e.g. the bespoke fit optimization schema).
The WBOFragmenter
and similar would be entirely stateless and not keep track of individual molecules.
The output data models (again pydantic
based) would look something like:
class Fragment(BaseModel):
smiles: str # A mapped SMILES pattern describing the fragment. The map index of
# the fragment will match the corresponding map indices of the parent
bond: Tuple[int, int] # The MAP indices of the atoms involved in the rotatable
# bond that the fragment was built around.
@property
def molecule(self):
return Molecule.from_smiles(self.smiles)
class FragmentationResult(BaseModel):
parent_smiles: str # A mapped SMILES pattern describing the parent molecule
fragments: List[Fragment] # The generated fragments. This can't be a dict as
# JSON cannot handle tuple keys unfortunately.
@property
def parent_molecule(self):
return Molecule.from_smiles(self.parent_smiles)
A mapping between each fragment and the parent could then be generated by:
parent = fragmentation_result.parent_molecule
fragment = fragmentation_result.fragments[0].molecule
fragment_to_parent = {
fragment_index: get_atom_index(parent, map_index)
for fragment_index, map_index in fragment.properties["atom_map"].items()
}
from openff-fragmenter.
@SimonBoothroyd : I like the idea of using pydantic to be able to easily serialize class options, but how would one modify the WBOFragmenter
factory options after instantiation? Would they still be accessible via class fields? Or would it be something like
# Change an option after factory has been instantiated
fragmenter.options.threshold = 0.04
from openff-fragmenter.
@SimonBoothroyd : I like the idea of using pydantic to be able to easily serialize class options, but how would one modify the WBOFragmenter factory options after instantiation? Would they still be accessible via class fields? Or would it be something like
I've been playing around with this a bit more and it seems to make sense to make the fragmenter class inherit directly from a pydantic base model. In this way options can be easily changed like normal properties:
fragmenter = WBOFragmenter(threshold=0.03)
# Change an option after factory has been instantiated
fragmenter.threshold = 0.05
and the whole factory can be easily (de)serialized:
>>> fragmenter = WBOFragmenter()
>>> fragmenter.json()
{
"functional_groups": {
"hydrazine": "[NX3:1][NX3:2]",
...
"tri_halide": "[#6:1]([F,Cl,I,Br:2])([F,Cl,I,Br:3])([F,Cl,I,Br:4])"
},
"scheme": "WBO",
"wbo_options": {
"method": "am1-wiberg-elf10",
"max_conformers": 800,
"rms_threshold": 1.0
},
"threshold": 0.03,
"heuristic": "path_length",
"keep_non_rotor_ring_substituents": false
}
I've updated the above comment to reflect this proposed API.
from openff-fragmenter.
Great! I've implemented this now in #103 (example notebook here) if anyone has any additional feedback they'd like to get in.
from openff-fragmenter.
Related Issues (20)
- Grow fragments when stereochemistry can not be fixed
- fragmented molecule atom mapping rearrangement HOT 3
- Fragmentation issue because of stereochem when the bond involves Sulfur HOT 3
- `find_rotatable_bonds` fails unless an atom map is present HOT 1
- Provide a way to map from `WBOFragmenter` input molecule to parent molecule HOT 1
- Use xtb for the WBO HOT 2
- Skip OpenEye functions if license not found, even if installed
- Support openforcefield.topology.Molecule? HOT 2
- Functional groups on rings do not come along with ring HOT 1
- Handle different charging failures appropriately
- Charging molecules while only returning the original coordinates returns a wonky molecule HOT 1
- Return parent fragment mapping HOT 2
- Ideas for future refactor HOT 1
- Fragmentation fails with new openeye toolkit. HOT 2
- Fragments have missing stereochemistry HOT 2
- Segmentation Fault HOT 5
- Wiberg Order Not Found HOT 2
- Omega returned error code 0 HOT 1
- Refactor into an OpenFF namespace
- Make fragmenters stateless
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from openff-fragmenter.